google-deepmind / alphafold

Open source code for AlphaFold 2.
Apache License 2.0
12.74k stars 2.26k forks source link

Jackhammer / stockholm.c "No space left on device" #778

Open neoformit opened 1 year ago

neoformit commented 1 year ago

Related to https://github.com/deepmind/alphafold/issues/280.

We are getting a disk write error after Jackhammer completes:

I0613 00:13:04.128991 139887425492800 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0613 01:04:43.969136 139887425492800 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 3099.840 seconds
Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 432, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/app/alphafold/run_alphafold.py", line 408, in main
    predict_structure(
  File "/app/alphafold/run_alphafold.py", line 172, in predict_structure
    feature_dict = data_pipeline.process(
  File "/app/alphafold/alphafold/data/pipeline.py", line 163, in process
    jackhmmer_uniref90_result = run_msa_tool(
  File "/app/alphafold/alphafold/data/pipeline.py", line 94, in run_msa_tool
    result = msa_runner.query(input_fasta_path, max_sto_sequences)[0]  # pytype: disable=wrong-arg-count
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 170, in query
    return self.query_multiple([input_fasta_path], max_sequences)[0]
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 181, in query_multiple
    single_chunk_results.append([self._query_chunk(
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 142, in _query_chunk
    raise RuntimeError(
RuntimeError: Jackhmmer failed
stderr:
Fatal exception (source file esl_msafile_stockholm.c, line 1278):
stockholm msa write failed
system error: No space left on device

After thousands of successful AF2 runs on our infrastructure, this has only occurred with the following protein input (2273AA):

>sequence_0
MGFVRQIQLLLWKNWTLRKRQKIRFVVELVWPLSLFLVLIWLRNANPLYSHHECHFPNKAMPSAGMLPWLQGIFCNVNNPCFQSPTPGESPGIVSNYNNSILARVYRDFQELLMNAPESQHLGRIWTELHILSQFMDTLRTHPERIAGRGIRIRDILKDEETLTLFLIKNIGLSDSVVYLLINSQVRPEQFAHGVPDLALKDIACSEALLERFIIFSQRRGAKTVRYALCSLSQGTLQWIEDTLYANVDFFKLFRVLPTLLDSRSQGINLRSWGGILSDMSPRIQEFIHRPSMQDLLWVTRPLMQNGGPETFTKLMGILSDLLCGYPEGGGSRVLSFNWYEDNNYKAFLGIDSTRKDPIYSYDRRTTSFCNALIQSLESNPLTKIAWRAAKPLLMGKILYTPDSPAARRILKNANSTFEELEHVRKLVKAWEEVGPQIWYFFDNSTQMNMIRDTLGNPTVKDFLNRQLGEEGITAEAILNFLYKGPRESQADDMANFDWRDIFNITDRTLRLVNQYLECLVLDKFESYNDETQLTQRALSLLEENMFWAGVVFPDMYPWTSSLPPHVKYKIRMDIDVVEKTNKIKDRYWDSGPRADPVEDFRYIWGGFAYLQDMVEQGITRSQVQAEAPVGIYLQQMPYPCFVDDSFMIILNRCFPIFMVLAWIYSVSMTVKSIVLEKELRLKETLKNQGVSNAVIWCTWFLDSFSIMSMSIFLLTIFIMHGRILHYSDPFILFLFLLAFSTATIMLCFLLSTFFSKASLAAACSGVIYFTLYLPHILCFAWQDRMTAELKKAVSLLSPVAFGFGTEYLVRFEEQGLGLQWSNIGNSPTEGDEFSFLLSMQMMLLDAAVYGLLAWYLDQVFPGDYGTPLPWYFLLQESYWLGGEGCSTREERALEKTEPLTEETEDPEHPEGIHDSFFEREHPGWVPGVCVKNLVKIFEPCGRPAVDRLNITFYENQITAFLGHNGAGKTTTLSILTGLLPPTSGTVLVGGRDIETSLDAVRQSLGMCPQHNILFHHLTVAEHMLFYAQLKGKSQEEAQLEMEAMLEDTGLHHKRNEEAQDLSGGMQRKLSVAIAFVGDAKVVILDEPTSGVDPYSRRSIWDLLLKYRSGRTIIMSTHHMDEADLLGDRIAIIAQGRLYCSGTPLFLKNCFGTGLYLTLVRKMKNIQSQRKGSEGTCSCSSKGFSTTCPAHVDDLTPEQVLDGDVNELMDVVLHHVPEAKLVECIGQELIFLLPNKNFKHRAYASLFRELEETLADLGLSSFGISDTPLEEIFLKVTEDSDSGPLFAGGAQQKRENVNPRHPCLGPREKAGQTPQDSNVCSPGAPAAHPEGQPPPEPECPGPQLNTGTQLVLQHVQALLVKRFQHTIRSHKDFLAQIVLPATFVFLALMLSIVIPPFGEYPALTLHPWIYGQQYTFFSMDEPGSEQFTVLADVLLNKPGFGNRCLKEGWLPEYPCGNSTPWKTPSVSPNITQLFQKQKWTQVNPSPSCRCSTREKLTMLPECPEGAGGLPPPQRTQRSTEILQDLTDRNISDFLVKTYPALIRSSLKSKFWVNEQRYGGISIGGKLPVVPITGEALVGFLSDLGRIMNVSGGPITREASKEIPDFLKHLETEDNIKVWFNNKGWHALVSFLNVAHNAILRASLPKDRSPEEYGITVISQPLNLTKEQLSEITVLTTSVDAVVAICVIFSMSFVPASFVLYLIQERVNKSKHLQFISGVSPTTYWVTNFLWDIMNYSVSAGLVVGIFIGFQKKAYTSPENLPALVALLLLYGWAVIPMMYPASFLFDVPSTAYVALSCANLFIGINSSAITFILELFENNRTLLRFNAVLRKLLIVFPHFCLGRGLIDLALSQAVTDVYARFGEEHSANPFHWDLIGKNLFAMVVEGVVYFLLTLLVQRHFFLSQWIAEPTKEPIVDEDDDVAEERQRIITGGNKTDILRLHELTKIYPGTSSPAVDRLCVGVRPGECFGLLGVNGAGKTTTFKMLTGDTTVTSGDATVAGKSILTNISEVHQNMGYCPQFDAIDELLTGREHLYLYARLRGVPAEEIEKVANWSIKSLGLTVYADCLAGTYSGGNKRKLSTAIALIGCPPLVLLDEPTTGMDPQARRMLWNVIVSIIREGRAVVLTSHSMEECEALCTRLAIMVKGAFRCMGTIQHLKSKFGDGYIVTMKIKSPKDDLLPDLNPVEQFFQGNFPGSVQRERHYNMLQFQVSSSSLARIFQLLLSHKDSLLIEEYSVTQTTLDQVFVNFAKQQTESHDLPLHPRAAGASRQAQD

According to the docker container, we have 86G of disk available at runtime for the tmp directory:

ubuntu@ip-0A000007:~$ docker exec -it 99eaf4da5086 /bin/bash
I have no name!@99eaf4da5086:/mnt/pulsar/files/staging/6489372/working$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
overlay         124G   39G   86G  32% /
I have no name!@99eaf4da5086:/mnt/pulsar/files/staging/6489372/working$ exit
exit
ubuntu@ip-0A000007:~$ df -h /mnt/scratch/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        63G   53M   60G   1% /mnt

Furthermore, I don't see any sign of disk filling up when I run watch df -h on the host. There is plenty of disk available on the root partition, where both /var/lib/docker and /tmp are located. It is possible that this could be a bug in AlphaFold. Any help would be greatly appreciated!

TurnerCD commented 1 year ago

I am getting this as well, have been unable to resolve

georgkempf commented 1 year ago

I experienced that for some protein sequences (zinc fingers) the raw MSA size (before truncation/filtering) can be > 100 GB. The out-of-memory probably occurs when the sequences from a HHblits/Jackhmmer job are transferred from RAM to the (temporary) file. The increase of memory probably occurs very fast (depending on the transfer rate from RAM to disk) and might be difficult to track with watch df. If the MSA size is the cause, then the RAM usage of the HHblits/Jackhmmer process should be already very large (up to 100 GB).