Open mbelouso opened 6 days ago
I had this issue, and it seems to be an error related to memory usage. I solved this issue by truncating the protein and MSA. This should be fixed as the try/except block in featurization doesn't shed any light on where the error precisely is happening.
Will investigate this, on top of my head it could be related to the MSA and input query having different lengths. I'll add a check for this, and will enable a stack trace dump in the featurizer so we can better diagnose people's issues.
@mbelouso could you share the input data you ran? Including your MSA?
Ran into the exact same issue, and it does seem to be related to input proteins not matching the length of the MSA (example, input is longer than MSA entry, and MSA entry doesn't have ----- for deletions). I would definitely benefit from a diagnosis and solution/workaround. The option to limit VRAM usage will probably help.
I'll make sure to add a check that the MSA is consistent with the input sequence! Thanks for flagging.
@mbelouso could you share the input data you ran? Including your MSA?
Happy to, but the MSA is 21 Mb, should I upload it somewhere?
Maybe somewhere on google drive, so I can download? Thanks!
The fasta and a3m file are here:
https://drive.google.com/drive/folders/1N9iW9_FmWL8yJItNnfdxv_974vAAIPm1?usp=drive_link
@mbelouso I just tried running this on an A100 GPU and it ran smoothly. Are you positive that this is the example that crashes? It's unclear to me why a memory error would give you this kind of stack trace. What type of hardware are you running this on?
Hi @mbelouso, I noticed that there are many sequences in your MSA file that are longer than the query sequence. You should remove the lowercase parts from these sequences, and after this, the processed sequences should be the same length as the query. This is how I resolved the issue, and I hope it can help you :D
Hi @jwohlwend,
I also encountered the same error. This is the data I used https://drive.google.com/drive/folders/13WXIZ8oBDL8jhq0J3H9OA73R25etmZdO?usp=drive_link. I tested it on A100.
This is the prompt content:
You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high')
which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: | | 0/? [00:00<?, ?it/s]Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Featurizer failed on 6b5m with error index 555 is out of bounds for axis 0 with size 555. Skipping.
Hi @mbelouso, I noticed that there are many sequences in your MSA file that are longer than the query sequence. You should remove the lowercase parts from these sequences, and after this, the processed sequences should be the same length as the query. This is how I resolved the issue, and I hope it can help you :D
Actually this is not correct, the lowercase letters are used to compute the deletion matrix and should be kept! I suspect the issue is elsewhere in this case
@bestz123 The download link is still private. I've requested access!
I'm getting the same problems (just with much bigger numbers) for the attached test case:
Featurizer failed on 7drt_test_processed_no_ligand with error index 1357771 is out of bounds for axis 0 with size 1357771. Skipping.
There seems to be something nondeterministic going on - I've had a few cases where the same input has worked once, and failed with the above behaviour on other runs.
@jwohlwend ,I have turned on permissions.
I've done some digging around, forcing the IndexError
to actually raise and going from there... the actual error is happening at: https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/feature/featurizer.py#L342. As far as I can tell, it's arising from some mismatch between how lower-case characters are handled in the a3m parser: https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/parse/a3m.py#L11 and how they're counted in BoltzTokenizer
: https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/tokenize/boltz.py#L31
If I hack https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/parse/a3m.py#L69-L70 to:
for c in line:
c = c.upper()
if c != "-" and c.islower():
... then I get successful runs every time. Haven't dug deeply enough into the logic to see exactly where the mismatch is happening, though.
The reason it's causing a runaway memory drain is that it's arising when trying to fetch entry 0 at https://github.com/jwohlwend/boltz/blob/e049f84004ff9296d632976ee5f3efd0e7700566/src/boltz/data/module/inference.py#L153 - on the exception it falls back to trying to fetch entry 0 again, leading to an infinite loop.
@tristanic thanks for doing some digging! I agree there must be some weird handling of lowercases but making them uppercase is certainly not what we want to do. I'm also unclear about why this is stochastic..
I agree there must be some weird handling of lowercases but making them uppercase is certainly not what we want to do.
Oh, definitely. That was meant more as just a demonstration that this is where the problem is arising. Helped to limit max_msa_seqs
to 8 to make visual analysis a bit easier... while adding some print statements for logging I noticed the distance it wanted to read past the end of the array matched the number of lowercase letters in the first 8 sequences.
I'm also unclear about why this is stochastic.
Yeah, that's still a complete mystery to me as well.
@mbelouso I just tried running this on an A100 GPU and it ran smoothly. Are you positive that this is the example that crashes? It's unclear to me why a memory error would give you this kind of stack trace. What type of hardware are you running this on?
Hardware: Dual socket Xeon workstation, RTX 3080, Linux Mint 21.2, 64Gb main memory.
你好@jwohlwend, 我也遇到了同样的错误。这是我使用的数据https://drive.google.com/drive/folders/13WXIZ8oBDL8jhq0J3H9OA73R25etmZdO?usp=drive_link。我在A100上测试过。 这是提示内容: You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To perfectly utilize them, you should set
torch.set_float32_matmul_precision('medium' | 'high')
which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Predicting: | | 0/? [00:00<?, ?it/s]Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过 。Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败 ,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。 Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。
I used this data to test, and this problem no longer exists in version 0.2.1.
你好@jwohlwend, 我也遇到了同样的错误。这是我使用的数据https://drive.google.com/drive/folders/13WXIZ8oBDL8jhq0J3H9OA73R25etmZdO?usp=drive_link。我在A100上测试过。 这是提示内容: You are using a CUDA device ('NVIDIA A100 80GB PCIe') that has Tensor Cores. To perfectly utilize them, you should set
torch.set_float32_matmul_precision('medium' | 'high')
which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] Predicting: | | 0/? [00:00<?, ?it/s]Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过 。Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败 ,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。 Featurizer 在 6b5m 上失败,错误索引 555 超出了大小为 555 的轴 0 的界限。跳过。
I used this data to test, and this problem no longer exists in version 0.2.1.
Ah! I think I've found the actual problem now. Nothing to do with lowercase characters at all (sorry for the red herring)... looks like it happens if the count of uppercase and '-' characters in the last used sequence in the MSA is smaller than the input sequence. boltz_msa_test_case.tar.gz
Test case using a two-sequence msa attached. success.a3m
has the second sequence with the correct length. fail.am3
has a single character deleted from the second sequence. I've tried with a larger MSA file, and setting max_msa_seqs
to different values in main.py
then tinkering with different edits to the sequences. Only the length of the final used sequence seems to matter, and the failure only appears when it's shorter than the input.
In my case it looks like this came from me making a mistake when writing a script to concatenate MSAs from our internal MMSeqs2 server (probably leaving a blank line?) then making things worse when trying to "fix" it due to my unfamiliarity with the .a3m format. After some minor retooling, things are now working correctly. Mea culpa... but at least I hope this will help others avoid the same pitfall!
@tristanic Thanks for digging in further. Glad you found the issue for your MSA's. I did verify the MSA lengths for the other users and did not find inconsistencies, not sure. Maybe there was a bug that got solved in the new releases. Will wait to hear from others before closing this issue.
Might it be worth adding a little sanity-checking in the parser to ensure each entry in the msa meets Boltz’s expectations?
On Thu, 21 Nov 2024 at 17:12, Jeremy Wohlwend @.***> wrote:
@tristanic https://github.com/tristanic Thanks for digging in further. Glad you found the issue for your MSA's. I did verify the MSA lengths for the other users and did not find inconsistencies, not sure. Maybe there was a bug that got solved in the new releases. Will wait to hear from others before closing this issue.
— Reply to this email directly, view it on GitHub https://github.com/jwohlwend/boltz/issues/4#issuecomment-2491818129, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YET32BPK6AVXAIRYBL2BYIANAVCNFSM6AAAAABR6YULAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJRHAYTQMJSHE . You are receiving this because you were mentioned.Message ID: @.***>
-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT
It sure would :)
I'm getting this error with the following inputs: (worth mentioning that it appears to still be running)
boltzenv boltz predict /opt/run_output/boltz/multimer.yaml --cache /databases/colabfold/boltz/weights/ --out_dir /
opt/run_output/boltz/outputs/ --devices 1
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: | | 0/? [00:00<?, ?it/s]
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
...
sequences:
- protein:
id: A
msa: /opt/run_output/colabfold/outputs/0.a3m
sequence: MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
- protein:
id: B
msa: /opt/run_output/colabfold/outputs/1.a3m
sequence: MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
version: 1
i got the msas by running colabfold locally.
gpu:
Fri Nov 22 14:23:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 33C P0 77W / 400W | 4115MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:BD:00.0 Off | 0 |
| N/A 35C P0 66W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
@danpf Are you on the latest boltz release?
so that was on 64be4d4351da47a14703f15ad3c361c054ed6cb1 which was not the latest release.
on the newest release I get this error:
(boltzenv) root@boltzeval004-n82zh:/opt/boltz2_dist# boltz predict /opt/run_output/boltz/multimer.yaml --cache /databases/colabfold/boltz/weights/ --out_dir /opt/run_output/boltz/outputs/ --devices 1
Checking input data.
Running predictions for 1 structure
Processing input data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.21it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
You are using a CUDA device ('NVIDIA A100-SXM4-80GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: | | 0/? [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1243, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/opt/conda/envs/boltzenv/lib/python3.12/threading.py", line 359, in wait
gotit = waiter.acquire(True, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 8329) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/boltzenv/bin/boltz", line 8, in <module>
sys.exit(cli())
^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/boltz/main.py", line 529, in predict
trainer.predict(
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 858, in predict
return call._call_and_handle_interrupt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 897, in _predict_impl
results = self._run(model, ckpt_path=ckpt_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 981, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1020, in _run_stage
return self.predict_loop.run()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/utilities.py", line 178, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/prediction_loop.py", line 121, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 133, in __next__
batch = super().__next__()
^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 60, in __next__
batch = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
out = next(self.iterators[0])
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1448, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1402, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/boltzenv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1256, in _try_get_data
raise RuntimeError(
RuntimeError: DataLoader worker (pid(s) 8329) exited unexpectedly
Predicting: | | 0/? [00:21<?, ?it/s]
It is worth mentioning that I have 2 A100s provisioned along with 2024Gi of ram
I'm using the latest version (actually the bleeding-edge version, reinstalled from the GitHub repo a couple of hours ago). I had no problem with @danpf's example on first run, but repeated runs with boltz predict ./multimer.yaml --override
sometimes run, sometimes fail. Something wrong with reading/writing from the cached .npz files?
Thanks for the info @tristanic I ran it 10x on the newest version (i was previously on 64be4d4351da47a14703f15ad3c361c054ed6cb1) and my errors were:
1. Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
2. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
3. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
4. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
5. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
6. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
7. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
8. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
9. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
10. ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
My first successful run had max_msa_seqs set to some fairly small number (currently you have to edit main.py to change that). Don’t remember what the number was (probably only 8 - tiny but the prediction still looked pretty good). Running with standard settings gives me a “Ran out of memory
On Fri, 22 Nov 2024 at 15:29, Daniel Farrell @.***> wrote:
Thanks for the info @tristanic https://github.com/tristanic I ran it 10x on the newest version (i was previously on 64be4d4 https://github.com/jwohlwend/boltz/commit/64be4d4351da47a14703f15ad3c361c054ed6cb1) and my errors were:
- Featurizer failed on multimer with error index 458752 is out of bounds for axis 0 with size 458752. Skipping.
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
- ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
— Reply to this email directly, view it on GitHub https://github.com/jwohlwend/boltz/issues/4#issuecomment-2494024666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFM54YEOQ7MEIRMRQYR37NL2B5EVDAVCNFSM6AAAAABR6YULAGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJUGAZDINRWGY . You are receiving this because you were mentioned.Message ID: @.***>
-- Altos Labs UK Limited | England | Company reg 13484917 Registered address: 3rd Floor 1 Ashley Road, Altrincham, Cheshire, United Kingdom, WA14 2DT
Running
boltz predict ./examples/multimer.yaml --cache /databases/colabfold/boltz/weights/ --out_dir /opt/run_boltz_example --use_msa_server
works no problem.
peeking at the msas from the colabfold server and they are very different:
(boltzenv) root@boltzeval004-vvgm9:/opt/boltz# tail -n +1 /opt/run_boltz_example/boltz_results_multimer/msa/multimer_*
==> /opt/run_boltz_example/boltz_results_multimer/msa/multimer_0.a3m <==
>101
MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
>UniRef100_UPI0020946B8C
--------VPADAVSFTLLQEQLHSVLDTLSEREAGVVAMRFGLTDGQPKTLDEIGKVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A7G1IHR1
------------------------------------MVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_UPI00227730C9
--------VAVDAVSFTLLQDQLQSVLETLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_UPI001C20C49A
MAHHHHHHVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLDGSSGSGTPEERLLRAIFGEKA
>UniRef100_A0A1A6BGF2
----------MDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A5C7WQ46
--------VAVDAVSFTLLQDQLQSVLETLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A7K0YHP6
--------VPADAVSFTLLQEQLHSVLDTLSEREAGVVAMRFGLTDGQPKTLDEIGKVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD---------------------
>UniRef100_A0A965INX0
-----------EAVTRIMLSQQIEQLLHNLPEREAGVIRMRFGLDDGQIHTLDDIGKRYNVTRERIRQIESKTMSKLRHPSRSQVLRDFFD---------------------
==> /opt/run_boltz_example/boltz_results_multimer/msa/multimer_1.a3m <==
>102
MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
>UniRef100_A0A961DSR9
---------------------------------CISDPDRWAAGGEDpELKALCRGCPRRWQCAKDALDTPGAEGMWSGVHIPKEGRGRNFALRQLRSLATHGG-------------
>UniRef100_X7YC61
--------------------MTATALYEVPLGVCTQDPDRWTTTPDNEAKAMCRACPRRWACARDAVESPGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRE-RVVAQSA
>UniRef100_UPI0009FDEDE2
---------------------------------CISDPDRWAAGGEDpELKALCRGCPRRWQCAKDALDTPGAEGMWSGVNIPKEGRGRKFALRQLRSLAAHGGFTVAD--------
>UniRef100_UPI00214E9A98
MRYAFAAESTTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA
>UniRef100_UPI00197F9F9C
--------------------MTATTLYEIPqLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCAKEAVESPGAEGLWAGVVIPDSGRPRAFALAQLRSLAERNGFAVRE-RVTAQSA
>UniRef100_A0A3S0RV96
--------------------MTATTLYEVPqLGVCTQDPDRWTTTPDDEAKTLCRACPRRWLCAKEAVESPGAEGLWAGVVIPDSGRPRAFALAQLRSLAERNGFAVRE-RVTAQSA
>UniRef100_A0A7V3I0A1
-------------------------------GACARDPERWTTAPDNEAKALCRACPRRWPCARDACELPGAEGLWAGVVIPEAGRPRAFALRQLRSLAERHGYPVRDPKVPAQPA
>UniRef100_A0A941Y9P8
--------------------MSAVTYLDIPIGACTRDPERWTTAADDDAKAICRACPRRWLCARDACELPRAEGLWAGIVIPEAGRGRTFALRQLRSLAERNGYPVRaTRRVFPESA
wheras my local run is 2.8M, 1.3M a3m
I assume the server must do some trimming/clustering what the local one does not do... I will look into options to reduce the msa
We get a similar error for some inputs
Lots of
Featurizer failed on XXX with error index 232820 is out of bounds for axis 0 with size 232820. Skipping.
eventually followed by
RecursionError: maximum recursion depth exceeded
--
[Previous line repeated 8 more times]
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py", line 146, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py", line 146, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py", line 146, in __getitem__
[Previous line repeated 959 more times]
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py", line 169, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py", line 169, in __getitem__
^^^^^^^^^^^^^^^^^^^
return self.__getitem__(0)
File "/opt/venv/lib/python3.12/site-packages/boltz/data/module/inference.py", line 169, in __getitem__
~~~~~~~~~~~~^^^^^
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
^^^^^^^^^^^^^^^^^^^^
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File "/opt/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
Traceback (most recent call last):
During handling of the above exception, another exception occurred:
RecursionError: maximum recursion depth exceeded
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I believe this is actually what we have to run to recreate the server results. https://github.com/soedinglab/MMseqs2-App/blob/master/backend/worker.go#L953 not colabfold_search.
Actually it is easier if you just fork colabfold and comment out the unlink commands around this location https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py#L507
the files you need are x.paired.a3m
Hey guys,
Great stuff, I've got it running on my system, but it seems to have some problems with proteins above a certain size. It runs fine with proteins of size < 100 aa, but anything bigger gives the following error:
Featurizer failed on gpr3 with error index 330 is out of bounds for axis 0 with size 330. Skipping.
At this point the program eats up all my main memory then crashes.... Super happy to share my input files if that helps.
cheers
matt B