ConnectionError: failed to connect to all addresses on Mac M1

ugm2 commented 2 years ago

Describe the bug

In my Macbook M1 Pro, I'm getting the following error when calling the Flow through an API:

ConnectionError: failed to connect to all addresses |Gateway: Communication error with deployment CustomTransformerTorchEncoder at address(es) {'0.0.0.0:50016'}. Head or worker(s) may be down.

Environment

I installed jina==3.4.7 on my Macbook M1 Pro, along with:

elasticsearch==8.2.0
spacy==3.3.0
torch==1.13.0
transformers==4.19.2

And then starting my FastAPI with Jina using the following command:

sudo JINA_MP_START_METHOD=spawn python -m uvicorn neural_search.api.server:app --reload --host 0.0.0.0 --port 5001

Where I use sudo because otherwise the program can't access the folders in my Mac.

The flow yaml file used is the following:

jtype: Flow
version: '1'

executors:
  - name: CustomTransformerTorchEncoder
    uses: 'CustomTransformerTorchEncoder'
    volumes: '~/.cache/huggingface:/root/.cache/huggingface'
    py_modules: 'neural_search/core/executors/encoder.py'
    uses_with:
      pretrained_model_name_or_path: 'sentence-transformers/all-MiniLM-L6-v2'
  - name: CustomIndexer
    uses: 'CustomIndexer'
    py_modules: 'neural_search/core/executors/indexer.py'
    install_requirements: True
    uses_with:
      traversal_right: '@c'
      traversal_left: '@r'
      n_dim: 512
    workspace: workspace
  - name: ranker
    uses: 'jinahub://SimpleRanker'
    install_requirements: True
    uses_with:
      metric: 'cosine'
      ranking: 'max'
      traversal_paths: '@c'
    volumes: 'workspace'

Where CustomTransformerTorchEncoder is in the following link and the CustomIndexer is in this other link

Screenshots

In the screenshot, when starting the service, you can see there are some warnings hidden there, but they are not really shown.

After calling the Flow I get this:

JoanFM commented 2 years ago

We would need to see the code with exact instructions how to run them

ugm2 commented 2 years ago

Here there is a minimum failure example:

jina-minimum-failure-example.zip

Remember to execute this in a Macbook with M1 chip (whichever version).

After some testing, the problem seems related to the new Pytorch M1 mps backend that allows the usage of the GPU on M1 chips.

In the class CustomTransformerTorchEncoder if instead of doing this:

if device is None:
    if torch.backends.mps.is_available():
        device = 'mps'
    elif torch.cuda.is_available():
        device = 'cuda'
    else:
        device = 'cpu'
self.device = torch.device(device)
self.embedding_fn_name = embedding_fn_name

self.tokenizer = AutoTokenizer.from_pretrained(base_tokenizer_model)
self.model = AutoModel.from_pretrained(
    pretrained_model_name_or_path, output_hidden_states=True
)
self.model.to(device).eval()

which will assign to device to mps because I have the latest Pytorch M1 version backend installed, I do this:

device = 'cpu'
self.device = torch.device(device)
self.embedding_fn_name = embedding_fn_name

self.tokenizer = AutoTokenizer.from_pretrained(base_tokenizer_model)
self.model = AutoModel.from_pretrained(
    pretrained_model_name_or_path, output_hidden_states=True
)
self.model.to(device).eval()

It works.

JoanFM commented 2 years ago

Then, we can consider this to be fixed? it does not seem to be Jina related problem?

ugm2 commented 2 years ago

Yeah, probably. Although it seems Jina could give a more insightful hint on what's going on? Just wondering

JoanFM commented 2 years ago

If u run this code without Jina, what Exception is given? the logs of the Executor should show that. @JohannesMessner any hint here?

ugm2 commented 2 years ago

It seems that the Encoder that I was using (which I got from Jina Hub) it's the one that is failing, and it seems it's not caused by other libraries like Transformers. It happens in the following line:

https://github.com/jina-ai/executor-text-transformers-torch-encoder/blob/61c2a0550f942fb54f539342e971501a605257a5/transform_encoder.py#L120

But in the __init__() function instead of doing:

self.device = device

I am doing the following to load M1 Torch backend:

if device is None:
    if torch.backends.mps.is_available():
        device = 'mps'
    elif torch.cuda.is_available():
        device = 'cuda'
    else:
        device = 'cpu'
self.device = torch.device(device)

JohannesMessner commented 2 years ago

The error reporting is indeed not ideal, and not expected either. If there is a Python exception being raised, it should be propagated back to the client to be raised there. @ugm2 could you help to create a minimal example to reproduce this? Could it be possible that Pytorch fails on the C++ layer without even raising a Python exception? From the error message it looks like the Executor runtime is entirely dead and not responding at all. In that case I don't it would be tough for us to report the error back to the user.

ugm2 commented 2 years ago

@JohannesMessner Here you have a minimal example (again, for Mac M1):

minimal-failure-example.zip

The error that I get is:

/AppleInternal/Library/BuildRoots/8d3bda53-8d9c-11ec-abd7-fa6a1964e34e/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:782: failed assertion `[MPSNDArray, initWithBuffer:descriptor:] Error: buffer is not large enough. Must be 36864 bytes`

The error is related to Metal installation rather than Pytorch itself.

JohannesMessner commented 2 years ago

Unfortunately I don't have an M1 Mac to run this on, but after looking into it it looks like the error occurs on the C code level, terminating the entire Process, including the Python environment. So unfortunately we have no way of propagating this error back through our network stack, which also lives in Python land.

The best we can do is report the Microservice that failed (remember that any solution here would also have to work in a cloud-native environment. So if a service just doesn't respond because its process is dead, we are basically out of luck).

jina-ai / serve

ConnectionError: failed to connect to all addresses on Mac M1 #4828