Featurizer failed... - Githubissues

mbelouso commented 6 days ago

Hey guys,

Great stuff, I've got it running on my system, but it seems to have some problems with proteins above a certain size. It runs fine with proteins of size < 100 aa, but anything bigger gives the following error:

Featurizer failed on gpr3 with error index 330 is out of bounds for axis 0 with size 330. Skipping.

At this point the program eats up all my main memory then crashes.... Super happy to share my input files if that helps.

cheers

matt B

delalamo commented 6 days ago

I had this issue, and it seems to be an error related to memory usage. I solved this issue by truncating the protein and MSA. This should be fixed as the try/except block in featurization doesn't shed any light on where the error precisely is happening.

jwohlwend commented 6 days ago

Will investigate this, on top of my head it could be related to the MSA and input query having different lengths. I'll add a check for this, and will enable a stack trace dump in the featurizer so we can better diagnose people's issues.

jwohlwend commented 6 days ago

@mbelouso could you share the input data you ran? Including your MSA?

amelie-iska commented 6 days ago

Ran into the exact same issue, and it does seem to be related to input proteins not matching the length of the MSA (example, input is longer than MSA entry, and MSA entry doesn't have ----- for deletions). I would definitely benefit from a diagnosis and solution/workaround. The option to limit VRAM usage will probably help.

jwohlwend commented 6 days ago

I'll make sure to add a check that the MSA is consistent with the input sequence! Thanks for flagging.

mbelouso commented 6 days ago

@mbelouso could you share the input data you ran? Including your MSA?

Happy to, but the MSA is 21 Mb, should I upload it somewhere?

jwohlwend commented 6 days ago

Maybe somewhere on google drive, so I can download? Thanks!

mbelouso commented 6 days ago

The fasta and a3m file are here:

https://drive.google.com/drive/folders/1N9iW9_FmWL8yJItNnfdxv_974vAAIPm1?usp=drive_link

jwohlwend commented 5 days ago

@mbelouso I just tried running this on an A100 GPU and it ran smoothly. Are you positive that this is the example that crashes? It's unclear to me why a memory error would give you this kind of stack trace. What type of hardware are you running this on?

cloverzizi commented 5 days ago

Hi @mbelouso, I noticed that there are many sequences in your MSA file that are longer than the query sequence. You should remove the lowercase parts from these sequences, and after this, the processed sequences should be the same length as the query. This is how I resolved the issue, and I hope it can help you :D

bestz123 commented 5 days ago

Hi @jwohlwend, I also encountered the same error. This is the data I used https://drive.google.com/drive/folders/13WXIZ8oBDL8jhq0J3H9OA73R25etmZdO?usp=drive_link. I tested it on A100. This is the prompt content： You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Predicting: | | 0/? [00:00<?, ?it/s]Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping. Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping. Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping. Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping. Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping. Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping. Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.

jwohlwend commented 5 days ago

Hi @mbelouso, I noticed that there are many sequences in your MSA file that are longer than the query sequence. You should remove the lowercase parts from these sequences, and after this, the processed sequences should be the same length as the query. This is how I resolved the issue, and I hope it can help you :D

Actually this is not correct, the lowercase letters are used to compute the deletion matrix and should be kept! I suspect the issue is elsewhere in this case

jwohlwend commented 5 days ago

@bestz123 The download link is still private. I've requested access!

tristanic commented 5 days ago

I'm getting the same problems (just with much bigger numbers) for the attached test case:

Featurizer failed on 7drt_test_processed_no_ligand with error index 1357771 is out of bounds for axis 0 with size 1357771. Skipping.

There seems to be something nondeterministic going on - I've had a few cases where the same input has worked once, and failed with the above behaviour on other runs.

7drt_test_run.zip

bestz123 commented 5 days ago

@jwohlwend ,I have turned on permissions.

tristanic commented 4 days ago

I've done some digging around, forcing the IndexError to actually raise and going from there... the actual error is happening at: https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/feature/featurizer.py#L342. As far as I can tell, it's arising from some mismatch between how lower-case characters are handled in the a3m parser: https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/parse/a3m.py#L11 and how they're counted in BoltzTokenizer: https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/tokenize/boltz.py#L31 If I hack https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/parse/a3m.py#L69-L70 to:

        for c in line:
            c = c.upper()
            if c != "-" and c.islower():

... then I get successful runs every time. Haven't dug deeply enough into the logic to see exactly where the mismatch is happening, though.

The reason it's causing a runaway memory drain is that it's arising when trying to fetch entry 0 at https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/module/inference.py#L153 - on the exception it falls back to trying to fetch entry 0 again, leading to an infinite loop.

jwohlwend commented 4 days ago

@tristanic thanks for doing some digging! I agree there must be some weird handling of lowercases but making them uppercase is certainly not what we want to do. I'm also unclear about why this is stochastic..

tristanic commented 4 days ago

I agree there must be some weird handling of lowercases but making them uppercase is certainly not what we want to do.

Oh, definitely. That was meant more as just a demonstration that this is where the problem is arising. Helped to limit max_msa_seqs to 8 to make visual analysis a bit easier... while adding some print statements for logging I noticed the distance it wanted to read past the end of the array matched the number of lowercase letters in the first 8 sequences.

I'm also unclear about why this is stochastic.

Yeah, that's still a complete mystery to me as well.

mbelouso commented 3 days ago

@mbelouso I just tried running this on an A100 GPU and it ran smoothly. Are you positive that this is the example that crashes? It's unclear to me why a memory error would give you this kind of stack trace. What type of hardware are you running this on?

Hardware: Dual socket Xeon workstation, RTX 3080, Linux Mint 21.2, 64Gb main memory.

bestz123 commented 3 days ago

你好@jwohlwend，我也遇到了同样的错误。这是我使用的数据https://drive.google.com/drive/folders/13WXIZ8oBDL8jhq0J3H9OA73R25etmZdO?usp=drive_link。我在A100上测试过。这是提示内容： You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To perfectly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high')which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Predicting: | | 0/? [00:00<?, ?it/s]Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。 Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。

I used this data to test, and this problem no longer exists in version 0.2.1.

bestz123 commented 3 days ago

你好@jwohlwend，我也遇到了同样的错误。这是我使用的数据https://drive.google.com/drive/folders/13WXIZ8oBDL8jhq0J3H9OA73R25etmZdO?usp=drive_link。我在A100上测试过。这是提示内容： You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To perfectly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high')which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Predicting: | | 0/? [00:00<?, ?it/s]Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。 Featurizer 在 6b5m 上失败，错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。

I used this data to test, and this problem no longer exists in version 0.2.1.

tristanic commented 3 days ago

Ah! I think I've found the actual problem now. Nothing to do with lowercase characters at all (sorry for the red herring)... looks like it happens if the count of uppercase and '-' characters in the last used sequence in the MSA is smaller than the input sequence. boltz_msa_test_case.tar.gz

Test case using a two-sequence msa attached. success.a3m has the second sequence with the correct length. fail.am3 has a single character deleted from the second sequence. I've tried with a larger MSA file, and setting max_msa_seqs to different values in main.py then tinkering with different edits to the sequences. Only the length of the final used sequence seems to matter, and the failure only appears when it's shorter than the input.

tristanic commented 3 days ago

In my case it looks like this came from me making a mistake when writing a script to concatenate MSAs from our internal MMSeqs2 server (probably leaving a blank line?) then making things worse when trying to "fix" it due to my unfamiliarity with the .a3m format. After some minor retooling, things are now working correctly. Mea culpa... but at least I hope this will help others avoid the same pitfall!

jwohlwend commented 3 days ago

@tristanic Thanks for digging in further. Glad you found the issue for your MSA's. I did verify the MSA lengths for the other users and did not find inconsistencies, not sure. Maybe there was a bug that got solved in the new releases. Will wait to hear from others before closing this issue.

tristanic commented 3 days ago

Might it be worth adding a little sanity-checking in the parser to ensure each entry in the msa meets Boltz’s expectations?

On Thu, 21 Nov 2024 at 17:12, Jeremy Wohlwend @.***> wrote:

@tristanic https://github.com/tristanic Thanks for digging in further. Glad you found the issue for your MSA's. I did verify the MSA lengths for the other users and did not find inconsistencies, not sure. Maybe there was a bug that got solved in the new releases. Will wait to hear from others before closing this issue.

— Reply to this email directly, view it on GitHub https://github.com/jwohlwend/boltz/issues/4#issuecomment-2491818129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YET32BPK6AVXAIRYBL2BYIANAVCNFSM6AAAAABR6YULAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJRHAYTQMJSHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT

jwohlwend commented 3 days ago

It sure would :)

danpf commented 2 days ago

I'm getting this error with the following inputs: (worth mentioning that it appears to still be running)

boltzenv boltz predict /opt/run_output/boltz/multimer.yaml --cache /databases/colabfold/boltz/weights/ --out_dir /
opt/run_output/boltz/outputs/ --devices 1

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: |                                                                                                                                   | 0/? [00:00<?, ?it/s]
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
...

sequences:
- protein:
    id: A
    msa: /opt/run_output/colabfold/outputs/0.a3m
    sequence: MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
- protein:
    id: B
    msa: /opt/run_output/colabfold/outputs/1.a3m
    sequence: MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
version: 1

i got the msas by running colabfold locally.

colabfold_outputs.zip

gpu:

Fri Nov 22 14:23:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              77W / 400W |   4115MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   35C    P0              66W / 400W |      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

jwohlwend commented 2 days ago

@danpf Are you on the latest boltz release?

danpf commented 2 days ago

so that was on 64be4d4351da47a14703f15ad3c361c054ed6cb1 which was not the latest release.

on the newest release I get this error:

(boltzenv) root@boltzeval004-n82zh:/opt/boltz2_dist# boltz predict /opt/run_output/boltz/multimer.yaml --cache /databases/colabfold/boltz/weights/ --out_dir /opt/run_output/boltz/outputs/ --devices 1
Checking input data.
Running predictions for 1 structure
Processing input data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.21it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: |                                                                                                                                   | 0/? [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1243, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/envs/boltzenv/lib/python3.12/threading.py", line 359, in wait
    gotit = waiter.acquire(True, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 8329) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/boltzenv/bin/boltz", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/boltz/main.py", line 529, in predict
    trainer.predict(
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
    return call._call_and_handle_interrupt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
    return self.predict_loop.run()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/prediction_loop.py", line 121, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
                                       ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1448, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1402, in _get_data
    success, data = self._try_get_data()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1256, in _try_get_data
    raise RuntimeError(
RuntimeError: DataLoader worker (pid(s) 8329) exited unexpectedly
Predicting: |          | 0/? [00:21<?, ?it/s]

danpf commented 2 days ago

It is worth mentioning that I have 2 A100s provisioned along with 2024Gi of ram

tristanic commented 2 days ago

I'm using the latest version (actually the bleeding-edge version, reinstalled from the GitHub repo a couple of hours ago). I had no problem with @danpf's example on first run, but repeated runs with boltz predict ./multimer.yaml --override sometimes run, sometimes fail. Something wrong with reading/writing from the cached .npz files?

danpf commented 2 days ago

Thanks for the info @tristanic I ran it 10x on the newest version (i was previously on 64be4d4351da47a14703f15ad3c361c054ed6cb1) and my errors were:

1. Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
2. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
3. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
4. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
5. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
6. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
7. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
8. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
9. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
10. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

tristanic commented 2 days ago

My first successful run had max_msa_seqs set to some fairly small number (currently you have to edit main.py to change that). Don’t remember what the number was (probably only 8 - tiny but the prediction still looked pretty good). Running with standard settings gives me a “Ran out of memory

skipping” warning - different from yours, but then I’m just playing with a toy 8GB GPU for testing.

On Fri, 22 Nov 2024 at 15:29, Daniel Farrell @.***> wrote:

Thanks for the info @tristanic https://github.com/tristanic I ran it 10x on the newest version (i was previously on 64be4d4 https://github.com/jwohlwend/boltz/commit/64be4d4351da47a14703f15ad3c361c054ed6cb1) and my errors were:

Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

— Reply to this email directly, view it on GitHub https://github.com/jwohlwend/boltz/issues/4#issuecomment-2494024666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YEOQ7MEIRMRQYR37NL2B5EVDAVCNFSM6AAAAABR6YULAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUGAZDINRWGY . You are receiving this because you were mentioned.Message ID: @.***>

-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT

danpf commented 2 days ago

Running

boltz predict ./examples/multimer.yaml --cache /databases/colabfold/boltz/weights/ --out_dir /opt/run_boltz_example --use_msa_server

works no problem.

peeking at the msas from the colabfold server and they are very different:

(boltzenv) root@boltzeval004-vvgm9:/opt/boltz# tail -n +1  /opt/run_boltz_example/boltz_results_multimer/msa/multimer_*
==> /opt/run_boltz_example/boltz_results_multimer/msa/multimer_0.a3m <==
>101
MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
>UniRef100_UPI0020946B8C
--------VPADAVSFTLLQEQLHSVLDTLSEREAGVVAMRFGLTDGQPKTLDEIGKVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A7G1IHR1
------------------------------------MVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_UPI00227730C9
--------VAVDAVSFTLLQDQLQSVLETLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_UPI001C20C49A
MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
>UniRef100_A0A1A6BGF2
----------MDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A5C7WQ46
--------VAVDAVSFTLLQDQLQSVLETLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A7K0YHP6
--------VPADAVSFTLLQEQLHSVLDTLSEREAGVVAMRFGLTDGQPKTLDEIGKVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A965INX0
-----------EAVTRIMLSQQIEQLLHNLPEREAGVIRMRFGLDDGQIHTLDDIGKRYNVTRERIRQIESKTMSKLRHPSRSQVLRDFFD---------------------

==> /opt/run_boltz_example/boltz_results_multimer/msa/multimer_1.a3m <==
>102
MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
>UniRef100_A0A961DSR9
---------------------------------CISDPDRWAAGGEDpELKALCRGCPRRWQCAKDALDTPGAEGMWSGVHIPKEGRGRNFALRQLRSLATHGG-------------
>UniRef100_X7YC61
--------------------MTATALYEVPLGVCTQDPDRWTTTPDNEAKAMCRACPRRWACARDAVESPGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRE-RVVAQSA
>UniRef100_UPI0009FDEDE2
---------------------------------CISDPDRWAAGGEDpELKALCRGCPRRWQCAKDALDTPGAEGMWSGVNIPKEGRGRKFALRQLRSLAAHGGFTVAD--------
>UniRef100_UPI00214E9A98
MRYAFAAESTTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
>UniRef100_UPI00197F9F9C
--------------------MTATTLYEIPqLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCAKEAVESPGAEGLWAGVVIPDSGRPRAFALAQLRSLAERNGFAVRE-RVTAQSA
>UniRef100_A0A3S0RV96
--------------------MTATTLYEVPqLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCAKEAVESPGAEGLWAGVVIPDSGRPRAFALAQLRSLAERNGFAVRE-RVTAQSA
>UniRef100_A0A7V3I0A1
-------------------------------GACARDPERWTTAPDNEAKALCRACPRRWPCARDACELPGAEGLWAGVVIPEAGRPRAFALRQLRSLAERHGYPVRDPKVPAQPA
>UniRef100_A0A941Y9P8
--------------------MSAVTYLDIPIGACTRDPERWTTAADDDAKAICRACPRRWLCARDACELPRAEGLWAGIVIPEAGRGRTFALRQLRSLAERNGYPVRaTRRVFPESA

wheras my local run is 2.8M, 1.3M a3m

I assume the server must do some trimming/clustering what the local one does not do... I will look into options to reduce the msa

tkramer-motion commented 2 days ago

We get a similar error for some inputs

Lots of

Featurizer failed on XXX with error index 232820 is out of bounds for axis 0 with size 232820. Skipping.

eventually followed by

RecursionError:   maximum recursion depth exceeded
--
[Previous line repeated 8 more times]
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File   "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py",   line 146, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File   "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py",   line 146, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File   "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py",   line 146, in __getitem__
[Previous line repeated 959 more times]
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File   "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py",   line 169, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File   "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py",   line 169, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File   "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py",   line 169, in __getitem__
~~~~~~~~~~~~^^^^^
data = [self.dataset[idx] for idx in   possibly_batched_index]
File   "/opt/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py",   line 52, in fetch
^^^^^^^^^^^^^^^^^^^^
data = fetcher.fetch(index) # type:   ignore[possibly-undefined]
File   "/opt/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py",   line 351, in _worker_loop
Traceback (most recent call last):
During handling of the above exception,   another exception occurred:
RecursionError: maximum recursion depth   exceeded
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

danpf commented 2 days ago

I believe this is actually what we have to run to recreate the server results. https://github.com/soedinglab/MMseqs2-App/blob/master/backend/worker.go#L953 not colabfold_search.

Actually it is easier if you just fork colabfold and comment out the unlink commands around this location https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py#L507

the files you need are x.paired.a3m

jwohlwend / boltz

Featurizer failed... #4