gpt-omni / mini-omni

open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
https://arxiv.org/abs/2408.16725
MIT License
2.4k stars 240 forks source link

Apple silicon support #6

Open bhupesh-sf opened 1 week ago

bhupesh-sf commented 1 week ago

Hey, this project seems really interesting. Currently hardly there any competitor to chatGPT advanced voice mode, but this seems to be in the same direction.

Currently device being used is cuda, can it run apple silicon? What changes would be required?

mini-omni commented 1 week ago

Thank you for your interest. Since we are not familiar with MLX, there are currently no plans for adaptation. We welcome contributions from the MLX community.

williamchai commented 1 week ago

Updated: added new lines at the end


Just figured how to run on M1 mac:

  1. save below diff to mac.patch
  2. run git apply mac.patch
  3. If you want to use graio webui, run cp webui/omni_gradio.py . and then python3 omni_gradio.py
diff --git a/inference.py b/inference.py
index 598c6cb..636cc08 100644
--- a/inference.py
+++ b/inference.py
@@ -26,6 +26,7 @@ from tqdm import tqdm
 from huggingface_hub import snapshot_download

+DEVICE = "cpu"
 torch.set_printoptions(sci_mode=False)

@@ -372,7 +373,7 @@ def download_model(ckpt_dir):

 class OmniInference:

-    def __init__(self, ckpt_dir='./checkpoint', device='cuda:0'):
+    def __init__(self, ckpt_dir='./checkpoint', device=DEVICE):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
@@ -506,7 +507,7 @@ class OmniInference:

 def test_infer():
-    device = "cuda:0"
+    device = DEVICE
     out_dir = f"./output/{get_time_str()}"
     ckpt_dir = f"./checkpoint"
     if not os.path.exists(ckpt_dir):
diff --git a/litgpt/model.py b/litgpt/model.py
index dc5cbd0..25e726e 100644
--- a/litgpt/model.py
+++ b/litgpt/model.py
@@ -220,7 +220,7 @@ class GPT(nn.Module):
         self,
         batch_size: int,
         rope_cache_length: Optional[int] = None,
-        device: Optional[torch.device] = None,
+        device: Optional[torch.device] = "cpu",
         dtype: Optional[torch.dtype] = None,
     ) -> None:
         if rope_cache_length is None:
diff --git a/server.py b/server.py
index 5740613..02ec12b 100644
--- a/server.py
+++ b/server.py
@@ -3,12 +3,12 @@ import base64
 import tempfile
 import traceback
 from flask import Flask, Response, stream_with_context
-from inference import OmniInference
+from inference import OmniInference, DEVICE

 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
-                 ckpt_dir='./checkpoint', device='cuda:0') -> None:
+                 ckpt_dir='./checkpoint', device=DEVICE) -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
diff --git a/utils/snac_utils.py b/utils/snac_utils.py
index a2a4a6a..ccee4c9 100644
--- a/utils/snac_utils.py
+++ b/utils/snac_utils.py
@@ -107,9 +107,9 @@ def reconstruct_tensors(flattened_output):
             tensor3.append(flattened_output[i + 6])
             tensor3.append(flattened_output[i + 7])
             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
             ]

     if n_tensors == 15:
@@ -133,10 +133,10 @@ def reconstruct_tensors(flattened_output):
             tensor4.append(flattened_output[i + 15])

             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
-                list_to_torch_tensor(tensor4).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
+                list_to_torch_tensor(tensor4),
             ]

     return codes
diff --git a/webui/omni_gradio.py b/webui/omni_gradio.py
index e3a60b9..bd8da8d 100644
--- a/webui/omni_gradio.py
+++ b/webui/omni_gradio.py
@@ -12,8 +12,8 @@ API_URL = os.getenv("API_URL", None)
 client = None

 if API_URL is None:
-    from inference import OmniInference
-    omni_client = OmniInference('./checkpoint', 'cuda:0')
+    from inference import OmniInference, DEVICE
+    omni_client = OmniInference('./checkpoint', DEVICE)
     omni_client.warm_up()
reachsak commented 1 week ago

Just figured how to run on M1 mac:

  1. save below diff to mac.patch
  2. run git apply mac.patch
  3. If you want to use graio webui, run cp webui/omni_gradio.py . and then python3 omni_gradio.py
diff --git a/inference.py b/inference.py
index 598c6cb..636cc08 100644
--- a/inference.py
+++ b/inference.py
@@ -26,6 +26,7 @@ from tqdm import tqdm
 from huggingface_hub import snapshot_download

+DEVICE = "cpu"
 torch.set_printoptions(sci_mode=False)

@@ -372,7 +373,7 @@ def download_model(ckpt_dir):

 class OmniInference:

-    def __init__(self, ckpt_dir='./checkpoint', device='cuda:0'):
+    def __init__(self, ckpt_dir='./checkpoint', device=DEVICE):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
@@ -506,7 +507,7 @@ class OmniInference:

 def test_infer():
-    device = "cuda:0"
+    device = DEVICE
     out_dir = f"./output/{get_time_str()}"
     ckpt_dir = f"./checkpoint"
     if not os.path.exists(ckpt_dir):
diff --git a/litgpt/model.py b/litgpt/model.py
index dc5cbd0..25e726e 100644
--- a/litgpt/model.py
+++ b/litgpt/model.py
@@ -220,7 +220,7 @@ class GPT(nn.Module):
         self,
         batch_size: int,
         rope_cache_length: Optional[int] = None,
-        device: Optional[torch.device] = None,
+        device: Optional[torch.device] = "cpu",
         dtype: Optional[torch.dtype] = None,
     ) -> None:
         if rope_cache_length is None:
diff --git a/server.py b/server.py
index 5740613..02ec12b 100644
--- a/server.py
+++ b/server.py
@@ -3,12 +3,12 @@ import base64
 import tempfile
 import traceback
 from flask import Flask, Response, stream_with_context
-from inference import OmniInference
+from inference import OmniInference, DEVICE

 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
-                 ckpt_dir='./checkpoint', device='cuda:0') -> None:
+                 ckpt_dir='./checkpoint', device=DEVICE) -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
diff --git a/utils/snac_utils.py b/utils/snac_utils.py
index a2a4a6a..ccee4c9 100644
--- a/utils/snac_utils.py
+++ b/utils/snac_utils.py
@@ -107,9 +107,9 @@ def reconstruct_tensors(flattened_output):
             tensor3.append(flattened_output[i + 6])
             tensor3.append(flattened_output[i + 7])
             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
             ]

     if n_tensors == 15:
@@ -133,10 +133,10 @@ def reconstruct_tensors(flattened_output):
             tensor4.append(flattened_output[i + 15])

             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
-                list_to_torch_tensor(tensor4).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
+                list_to_torch_tensor(tensor4),
             ]

     return codes
diff --git a/webui/omni_gradio.py b/webui/omni_gradio.py
index e3a60b9..bd8da8d 100644
--- a/webui/omni_gradio.py
+++ b/webui/omni_gradio.py
@@ -12,8 +12,8 @@ API_URL = os.getenv("API_URL", None)
 client = None

 if API_URL is None:
-    from inference import OmniInference
-    omni_client = OmniInference('./checkpoint', 'cuda:0')
+    from inference import OmniInference, DEVICE
+    omni_client = OmniInference('./checkpoint', DEVICE)
     omni_client.warm_up()

Hi, Can you please elaborate more ? Do I need to create mac.patch and save it with this content ? And should I git apply mac.patch after cloning this project ? Thank you

yukiarimo commented 1 week ago

+1 for the full tutor

bhupesh-sf commented 1 week ago

Just figured how to run on M1 mac:

  1. save below diff to mac.patch
  2. run git apply mac.patch
  3. If you want to use graio webui, run cp webui/omni_gradio.py . and then python3 omni_gradio.py
diff --git a/inference.py b/inference.py
index 598c6cb..636cc08 100644
--- a/inference.py
+++ b/inference.py
@@ -26,6 +26,7 @@ from tqdm import tqdm
 from huggingface_hub import snapshot_download

+DEVICE = "cpu"
 torch.set_printoptions(sci_mode=False)

@@ -372,7 +373,7 @@ def download_model(ckpt_dir):

 class OmniInference:

-    def __init__(self, ckpt_dir='./checkpoint', device='cuda:0'):
+    def __init__(self, ckpt_dir='./checkpoint', device=DEVICE):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
@@ -506,7 +507,7 @@ class OmniInference:

 def test_infer():
-    device = "cuda:0"
+    device = DEVICE
     out_dir = f"./output/{get_time_str()}"
     ckpt_dir = f"./checkpoint"
     if not os.path.exists(ckpt_dir):
diff --git a/litgpt/model.py b/litgpt/model.py
index dc5cbd0..25e726e 100644
--- a/litgpt/model.py
+++ b/litgpt/model.py
@@ -220,7 +220,7 @@ class GPT(nn.Module):
         self,
         batch_size: int,
         rope_cache_length: Optional[int] = None,
-        device: Optional[torch.device] = None,
+        device: Optional[torch.device] = "cpu",
         dtype: Optional[torch.dtype] = None,
     ) -> None:
         if rope_cache_length is None:
diff --git a/server.py b/server.py
index 5740613..02ec12b 100644
--- a/server.py
+++ b/server.py
@@ -3,12 +3,12 @@ import base64
 import tempfile
 import traceback
 from flask import Flask, Response, stream_with_context
-from inference import OmniInference
+from inference import OmniInference, DEVICE

 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
-                 ckpt_dir='./checkpoint', device='cuda:0') -> None:
+                 ckpt_dir='./checkpoint', device=DEVICE) -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
diff --git a/utils/snac_utils.py b/utils/snac_utils.py
index a2a4a6a..ccee4c9 100644
--- a/utils/snac_utils.py
+++ b/utils/snac_utils.py
@@ -107,9 +107,9 @@ def reconstruct_tensors(flattened_output):
             tensor3.append(flattened_output[i + 6])
             tensor3.append(flattened_output[i + 7])
             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
             ]

     if n_tensors == 15:
@@ -133,10 +133,10 @@ def reconstruct_tensors(flattened_output):
             tensor4.append(flattened_output[i + 15])

             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
-                list_to_torch_tensor(tensor4).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
+                list_to_torch_tensor(tensor4),
             ]

     return codes
diff --git a/webui/omni_gradio.py b/webui/omni_gradio.py
index e3a60b9..bd8da8d 100644
--- a/webui/omni_gradio.py
+++ b/webui/omni_gradio.py
@@ -12,8 +12,8 @@ API_URL = os.getenv("API_URL", None)
 client = None

 if API_URL is None:
-    from inference import OmniInference
-    omni_client = OmniInference('./checkpoint', 'cuda:0')
+    from inference import OmniInference, DEVICE
+    omni_client = OmniInference('./checkpoint', DEVICE)
     omni_client.warm_up()

It might be a dumb question, but why device as CPU, I think device can be MPS for apple silicon because that gives the option to use GPU as well??

scalar27 commented 1 week ago

I have an M1 Mac. After following the installation procedure on this github, I copied the code above to a new file called mac.patch. However, when I do git apply mac.patch, I get this error: error: corrupt patch at line 108 (which is at the end of the file). Any suggestions?

vshelke commented 1 week ago

@scalar27 the issue is the diff file needs three or more new lines at the end. I think while copying the above diff GitHub must have stripped the new line data. Try the below diff.

diff --git a/inference.py b/inference.py
index 598c6cb..636cc08 100644
--- a/inference.py
+++ b/inference.py
@@ -26,6 +26,7 @@ from tqdm import tqdm
 from huggingface_hub import snapshot_download

+DEVICE = "cpu"
 torch.set_printoptions(sci_mode=False)

@@ -372,7 +373,7 @@ def download_model(ckpt_dir):

 class OmniInference:

-    def __init__(self, ckpt_dir='./checkpoint', device='cuda:0'):
+    def __init__(self, ckpt_dir='./checkpoint', device=DEVICE):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
@@ -506,7 +507,7 @@ class OmniInference:

 def test_infer():
-    device = "cuda:0"
+    device = DEVICE
     out_dir = f"./output/{get_time_str()}"
     ckpt_dir = f"./checkpoint"
     if not os.path.exists(ckpt_dir):
diff --git a/litgpt/model.py b/litgpt/model.py
index dc5cbd0..25e726e 100644
--- a/litgpt/model.py
+++ b/litgpt/model.py
@@ -220,7 +220,7 @@ class GPT(nn.Module):
         self,
         batch_size: int,
         rope_cache_length: Optional[int] = None,
-        device: Optional[torch.device] = None,
+        device: Optional[torch.device] = "cpu",
         dtype: Optional[torch.dtype] = None,
     ) -> None:
         if rope_cache_length is None:
diff --git a/server.py b/server.py
index 5740613..02ec12b 100644
--- a/server.py
+++ b/server.py
@@ -3,12 +3,12 @@ import base64
 import tempfile
 import traceback
 from flask import Flask, Response, stream_with_context
-from inference import OmniInference
+from inference import OmniInference, DEVICE

 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
-                 ckpt_dir='./checkpoint', device='cuda:0') -> None:
+                 ckpt_dir='./checkpoint', device=DEVICE) -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
diff --git a/utils/snac_utils.py b/utils/snac_utils.py
index a2a4a6a..ccee4c9 100644
--- a/utils/snac_utils.py
+++ b/utils/snac_utils.py
@@ -107,9 +107,9 @@ def reconstruct_tensors(flattened_output):
             tensor3.append(flattened_output[i + 6])
             tensor3.append(flattened_output[i + 7])
             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
             ]

     if n_tensors == 15:
@@ -133,10 +133,10 @@ def reconstruct_tensors(flattened_output):
             tensor4.append(flattened_output[i + 15])

             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
-                list_to_torch_tensor(tensor4).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
+                list_to_torch_tensor(tensor4),
             ]

     return codes
diff --git a/webui/omni_gradio.py b/webui/omni_gradio.py
index e3a60b9..bd8da8d 100644
--- a/webui/omni_gradio.py
+++ b/webui/omni_gradio.py
@@ -12,8 +12,8 @@ API_URL = os.getenv("API_URL", None)
 client = None

 if API_URL is None:
-    from inference import OmniInference
-    omni_client = OmniInference('./checkpoint', 'cuda:0')
+    from inference import OmniInference, DEVICE
+    omni_client = OmniInference('./checkpoint', DEVICE)
     omni_client.warm_up()
Goekdeniz-Guelmez commented 1 week ago

@scalar27 the issue is the diff file needs three or more new lines at the end. I think while copying the above diff GitHub must have stripped the new line data. Try the below diff.

diff --git a/inference.py b/inference.py
index 598c6cb..636cc08 100644
--- a/inference.py
+++ b/inference.py
@@ -26,6 +26,7 @@ from tqdm import tqdm
 from huggingface_hub import snapshot_download

+DEVICE = "cpu"
 torch.set_printoptions(sci_mode=False)

@@ -372,7 +373,7 @@ def download_model(ckpt_dir):

 class OmniInference:

-    def __init__(self, ckpt_dir='./checkpoint', device='cuda:0'):
+    def __init__(self, ckpt_dir='./checkpoint', device=DEVICE):
         self.device = device
         if not os.path.exists(ckpt_dir):
             print(f"checkpoint directory {ckpt_dir} not found, downloading from huggingface")
@@ -506,7 +507,7 @@ class OmniInference:

 def test_infer():
-    device = "cuda:0"
+    device = DEVICE
     out_dir = f"./output/{get_time_str()}"
     ckpt_dir = f"./checkpoint"
     if not os.path.exists(ckpt_dir):
diff --git a/litgpt/model.py b/litgpt/model.py
index dc5cbd0..25e726e 100644
--- a/litgpt/model.py
+++ b/litgpt/model.py
@@ -220,7 +220,7 @@ class GPT(nn.Module):
         self,
         batch_size: int,
         rope_cache_length: Optional[int] = None,
-        device: Optional[torch.device] = None,
+        device: Optional[torch.device] = "cpu",
         dtype: Optional[torch.dtype] = None,
     ) -> None:
         if rope_cache_length is None:
diff --git a/server.py b/server.py
index 5740613..02ec12b 100644
--- a/server.py
+++ b/server.py
@@ -3,12 +3,12 @@ import base64
 import tempfile
 import traceback
 from flask import Flask, Response, stream_with_context
-from inference import OmniInference
+from inference import OmniInference, DEVICE

 class OmniChatServer(object):
     def __init__(self, ip='0.0.0.0', port=60808, run_app=True,
-                 ckpt_dir='./checkpoint', device='cuda:0') -> None:
+                 ckpt_dir='./checkpoint', device=DEVICE) -> None:
         server = Flask(__name__)
         # CORS(server, resources=r"/*")
         # server.config["JSON_AS_ASCII"] = False
diff --git a/utils/snac_utils.py b/utils/snac_utils.py
index a2a4a6a..ccee4c9 100644
--- a/utils/snac_utils.py
+++ b/utils/snac_utils.py
@@ -107,9 +107,9 @@ def reconstruct_tensors(flattened_output):
             tensor3.append(flattened_output[i + 6])
             tensor3.append(flattened_output[i + 7])
             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
             ]

     if n_tensors == 15:
@@ -133,10 +133,10 @@ def reconstruct_tensors(flattened_output):
             tensor4.append(flattened_output[i + 15])

             codes = [
-                list_to_torch_tensor(tensor1).cuda(),
-                list_to_torch_tensor(tensor2).cuda(),
-                list_to_torch_tensor(tensor3).cuda(),
-                list_to_torch_tensor(tensor4).cuda(),
+                list_to_torch_tensor(tensor1),
+                list_to_torch_tensor(tensor2),
+                list_to_torch_tensor(tensor3),
+                list_to_torch_tensor(tensor4),
             ]

     return codes
diff --git a/webui/omni_gradio.py b/webui/omni_gradio.py
index e3a60b9..bd8da8d 100644
--- a/webui/omni_gradio.py
+++ b/webui/omni_gradio.py
@@ -12,8 +12,8 @@ API_URL = os.getenv("API_URL", None)
 client = None

 if API_URL is None:
-    from inference import OmniInference
-    omni_client = OmniInference('./checkpoint', 'cuda:0')
+    from inference import OmniInference, DEVICE
+    omni_client = OmniInference('./checkpoint', DEVICE)
     omni_client.warm_up()

Works for me! Thanks!

scalar27 commented 1 week ago

This works for me too. There is a lot of stuttering in the output voice. Could something be changed to improve this? Someone asked about supporting mps as DEVICE instead of cpu. It would be great if someone could try to add that support. Also, In omni_streamlit.py I tried playing the the output chunk size but it didn't seem to help with the stuttering.

mini-omni commented 1 week ago

hi, guys, I just fix the device problem, so that one can run the inference with 'cpu' device by set device='cpu' in line: https://github.com/gpt-omni/mini-omni/blob/main/inference.py#L375

mini-omni commented 1 week ago

This works for me too. There is a lot of stuttering in the output voice. Could something be changed to improve this? Someone asked about supporting mps as DEVICE instead of cpu. It would be great if someone could try to add that support. Also, In omni_streamlit.py I tried playing the the output chunk size but it didn't seem to help with the stuttering.

hi, @scalar27 , we notice that sometimes the stuttering occurs, and we think the reason is that the predicted audio tokens is not good enough.

scalar27 commented 1 week ago

On my M1 Mac (Max 64 Gb) I get stuttering all the time. I never get a smooth sentence that is spoken without stuttering. With cuda you do not have that situation, right? I'm guessing if the code could use the Mac GPU it would make a big improvement.

mini-omni commented 1 week ago

On my M1 Mac (Max 64 Gb) I get stuttering all the time. I never get a smooth sentence that is spoken without stuttering. With cuda you do not have that situation, right? I'm guessing if the code could use the Mac GPU it would make a big improvement.

can you give an example? When using cuda, the output is just what the demo video shows.

scalar27 commented 1 week ago

Archive.zip I zipped up the two small audio files. one is the stream version (very choppy/stutter) and the other is the downloaded audio (not perfect but almost).

kunci115 commented 1 week ago

clone from the forked repo i don't have it stuttering, I have run it on my mac, but seems whisper not doing transcription fluently, but atleast it work end to end https://github.com/kunci115/mini-omni-mac-support

scalar27 commented 1 week ago

I cloned the fork in a separate conda env and local folder to avoid conflicts, with the original mini-omni. When I ran the fork, the first error I got was: NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

when I added the fallback, things worked and I got speech output, but with the same stuttering/choppy audio. I'm not sure it's "latency" because the delay to start speaking is not bad, but the "chunks" are too short (not sure I am using that term properly).

gantuo commented 5 days ago

i have tried to use device='mps' on my m2 macbook, but there is some ops not implemented on mps platform. this is a torch issue.

kunci115 commented 5 days ago

I cloned the fork in a separate conda env and local folder to avoid conflicts, with the original mini-omni. When I ran the fork, the first error I got was: NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

when I added the fallback, things worked and I got speech output, but with the same stuttering/choppy audio. I'm not sure it's "latency" because the delay to start speaking is not bad, but the "chunks" are too short (not sure I am using that term properly).

in macbook air m3 I didn't have this issue, if you can edit readme in the fork mac implementation in m2 chip step by step will be helpful for others