Closed ugm2 closed 2 years ago
We would need to see the code with exact instructions how to run them
Here there is a minimum failure example:
jina-minimum-failure-example.zip
Remember to execute this in a Macbook with M1 chip (whichever version).
After some testing, the problem seems related to the new Pytorch M1 mps backend that allows the usage of the GPU on M1 chips.
In the class CustomTransformerTorchEncoder
if instead of doing this:
if device is None:
if torch.backends.mps.is_available():
device = 'mps'
elif torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'
self.device = torch.device(device)
self.embedding_fn_name = embedding_fn_name
self.tokenizer = AutoTokenizer.from_pretrained(base_tokenizer_model)
self.model = AutoModel.from_pretrained(
pretrained_model_name_or_path, output_hidden_states=True
)
self.model.to(device).eval()
which will assign to device
to mps
because I have the latest Pytorch M1 version backend installed, I do this:
device = 'cpu'
self.device = torch.device(device)
self.embedding_fn_name = embedding_fn_name
self.tokenizer = AutoTokenizer.from_pretrained(base_tokenizer_model)
self.model = AutoModel.from_pretrained(
pretrained_model_name_or_path, output_hidden_states=True
)
self.model.to(device).eval()
It works.
Then, we can consider this to be fixed? it does not seem to be Jina related problem?
Yeah, probably. Although it seems Jina could give a more insightful hint on what's going on? Just wondering
If u run this code without Jina, what Exception is given? the logs of the Executor should show that. @JohannesMessner any hint here?
It seems that the Encoder that I was using (which I got from Jina Hub) it's the one that is failing, and it seems it's not caused by other libraries like Transformers. It happens in the following line:
But in the __init__()
function instead of doing:
self.device = device
I am doing the following to load M1 Torch backend:
if device is None:
if torch.backends.mps.is_available():
device = 'mps'
elif torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'
self.device = torch.device(device)
The error reporting is indeed not ideal, and not expected either. If there is a Python exception being raised, it should be propagated back to the client to be raised there. @ugm2 could you help to create a minimal example to reproduce this? Could it be possible that Pytorch fails on the C++ layer without even raising a Python exception? From the error message it looks like the Executor runtime is entirely dead and not responding at all. In that case I don't it would be tough for us to report the error back to the user.
@JohannesMessner Here you have a minimal example (again, for Mac M1):
The error that I get is:
/AppleInternal/Library/BuildRoots/8d3bda53-8d9c-11ec-abd7-fa6a1964e34e/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:782: failed assertion `[MPSNDArray, initWithBuffer:descriptor:] Error: buffer is not large enough. Must be 36864 bytes`
The error is related to Metal installation rather than Pytorch itself.
Unfortunately I don't have an M1 Mac to run this on, but after looking into it it looks like the error occurs on the C code level, terminating the entire Process, including the Python environment. So unfortunately we have no way of propagating this error back through our network stack, which also lives in Python land.
The best we can do is report the Microservice that failed (remember that any solution here would also have to work in a cloud-native environment. So if a service just doesn't respond because its process is dead, we are basically out of luck).
Describe the bug
In my Macbook M1 Pro, I'm getting the following error when calling the Flow through an API:
Environment
I installed
jina==3.4.7
on my Macbook M1 Pro, along with:And then starting my FastAPI with Jina using the following command:
Where I use
sudo
because otherwise the program can't access the folders in my Mac.The flow yaml file used is the following:
Where CustomTransformerTorchEncoder is in the following link and the CustomIndexer is in this other link
Screenshots
In the screenshot, when starting the service, you can see there are some warnings hidden there, but they are not really shown.
After calling the Flow I get this: