google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.3k stars 2.2k forks source link

AlphaFold (or hmmsearch) is not able to parse some pdb_seqres.txt due to unusual residue naming #591

Closed TeletcheaLab closed 2 years ago

TeletcheaLab commented 2 years ago

Dear all, I had trouble running a prediction with updated pdb_seqres.txt files since some entries contain unusual DNA residue names, PDB code 7ooo, 7oos and 7ozz. These nucleic acids are modified residues but do not follow DNA alphabet, so the parser fails with an error on the letter "0" (zero)

Traceback here and details below:

Traceback (most recent call last): File "/app/alphafold/run_alphafold.py", line 422, in app.run(main) File "/opt/alphafoldenv/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/opt/alphafoldenv/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/app/alphafold/run_alphafold.py", line 398, in main predict_structure( File "/app/alphafold/run_alphafold.py", line 172, in predict_structure feature_dict = data_pipeline.process( File "/app/alphafold/alphafold/data/pipeline_multimer.py", line 264, in process chain_features = self._process_single_chain( File "/app/alphafold/alphafold/data/pipeline_multimer.py", line 212, in _process_single_chain chain_features = self._monomer_data_pipeline.process( File "/app/alphafold/alphafold/data/pipeline.py", line 185, in process pdb_templates_result = self.template_searcher.query(msa_for_templates) File "/app/alphafold/alphafold/data/tools/hmmsearch.py", line 79, in query return self.query_with_hmm(hmm) File "/app/alphafold/alphafold/data/tools/hmmsearch.py", line 112, in query_with_hmm raise RuntimeError( RuntimeError: hmmsearch failed: stdout:

hmmsearch :: search profile(s) against a sequence database

HMMER 3.3.2 (Nov 2020); http://hmmer.org/

Copyright (C) 2020 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

query HMM file: /tmp/tmp2i0w1r3m/query.hmm

target sequence database: /scratch/shared/dataset/alphafold_data/pdb_seqres/pdb_seqres.txt

MSA of all hits saved to file: /tmp/tmp2i0w1r3m/output.sto

show alignments in output: no

sequence reporting threshold: E-value <= 100

domain reporting threshold: E-value <= 100

sequence inclusion threshold: E-value <= 100

domain inclusion threshold: E-value <= 100

MSV filter P threshold: <= 0.1

Vit filter P threshold: <= 0.1

Fwd filter P threshold: <= 0.1

number of worker threads: 8

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query: query [M=242]

stderr: Parse failed (sequence file /scratch/shared/dataset/alphafold_data/pdb_seqres/pdb_seqres.txt): Line 1364756: illegal character 0

After manually editing the file to remove the "05H" character (the modified DNA nucleotide) the error is gone. Here is a full diff:

diff -Naup pdb_seqres/pdb_seqres.txt-orig pdb_seqres/pdb_seqres.txt --- pdb_seqres/pdb_seqres.txt-orig 2022-09-13 00:19:53.000000000 +0200 +++ pdb_seqres/pdb_seqres.txt 2022-09-13 00:36:37.000000000 +0200 @@ -1360655,9 +1360655,9 @@ CAAAGAAAAG

7ooo_D mol:na length:10 RNA (5'-R(CPAPAPAPGPAPAPAPAPG)-3') CAAAGAAAAG 7ooo_B mol:na length:11 DNA (5'-D(CPTP(RWQ)PTPCPTPTPTPG)-3') -CT05ATCTTTG +CTATCTTTG 7ooo_E mol:na length:11 DNA (5'-D(CPTP(RWQ)PTPCPTPTPTPG)-3') -CT05ATCTTTG +CTATCTTTG 7oop_A mol:protein length:1970 DNA-directed RNA polymerase II subunit RPB1 MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPETTEGGRPKLGGLMDPRQGVIERTGRCQTCAGNMTECPGHFGHIELAKPVFHVGFLVKTMKVLRCVCFFCSKLLVDSNNPKIKDILAKSKGQPKKRLTHVYDLCKGKNICEGGEEMDNKFGVEQPEGDEDLTKEKGHGGCGRYQPRIRRSGLELYAEWKHVNEDSQEKKILLSPERVHEIFKRISDEECFVLGMEPRYARPEWMIVTVLPVPPLSVRPAVVMQGSARNQDDLTHKLADIVKINNQLRRNEQNGAAAHVIAEDVKLLQFHVATMVDNELPGLPRAMQKSGRPLKSLKQRLKGKEGRVRGNLMGKRVDFSARTVITPDPNLSIDQVGVPRSIAANMTFAEIVTPFNIDRLQELVRRGNSQYPGAKYIIRDNGDRIDLRFHPKPSDLHLQTGYKVERHMCDGDIVIFNRQPTLHKMSMMGHRVRILPWSTFRLNLSVTTPYNADFDGDEMNLHLPQSLETRAEIQELAMVPRMIVTPQSNRPVMGIVQDTLTAVRKFTKRDVFLERGEVMNLLMFLSTWDGKVPQPAILKPRPLWTGKQIFSLIIPGHINCIRTHSTHPDDEDSGPYKHISPGDTKVVVENGELIMGILCKKSLGTSAGSLVHISYLEMGHDITRLFYSNIQTVINNWLLIEGHTIGIGDSIADSKTYQDIQNTIKKAKQDVIEVIEKAHNNELEPTPGNTLRQTFENQVNRILNDARDKTGSSAQKSLSEYNNFKSMVVSGAKGSKINISQVIAVVGQQNVEGKRIPFGFKHRTLPHFIKDDYGPESRGFVENSYLAGLTPTEFFFHAMGGREGLIDTAVKTAETGYIQRRLIKSMESVMVKYDATVRNSINQVVQLRYGEDGLAGESVEFQNLATLKPSNKAFEKKFRFDYTNERALRRTLQEDLVKDVLSNAHIQNELEREFERMREDREVLRVIFPTGDSKVVLPCNLLRMIWNAQKIFHINPRLPSDLHPIKVVEGVKELSKKLVIVNGDDPLSRQAQENATLLFNIHLRSTLCSRRMAEEFRLSGEAFDWLLGEIESKFNQAIAHPGEMVGALAAQSLGEPATQMTLNTFHYAGVSAKNVTLGVPRLKELINISKKPKTPSLTVFLLGQSARDAERAKDILCRLEHTTLRKVTANTAIYYDPNPQSTVVAEDQEWVNVYYEMPDFDVARISPWLLRVELDRKHMTDRKLTMEQIAEKINAGFGDDLNCIFNDDNAEKLVLRIRIMNSDENKMQEEEEVVDKMDDDVFLRCIESNMLTDMTLQGIEQISKVYMHLPQTDNKKKIIITEDGEFKALQEWILETDGVSLMRVLSEKDVDPVRTTSNDIVEIFTVLGIEAVRKALERELYHVISFDGSYVNYRHLALLCDTMTCRGHLMAITRHGVNRQDTGPLMKCSFEETVDVLMEAAAHGESDPMKGVSENIMLGQLAPAGTGCFDLLLDAEKCKYGMEIPTNIPGLGAAGPTGMFFGSAPSPMGGISPAMTPWNQGATPAYGAWSPSVGSGMTPGAAGFSPSAASDASGFSPGYSPAWSPTPGSPGSPGPSSPYIPSPGGAMSPSYSPTSPAYEPRSPGGYTPQSPSYSPTSPSYSPTSPSYSPTSPNYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPNYSPTSPNYTPTSPSYSPTSPSYSPTSPNYTPTSPNYSPTSPSYSPTSPSYSPTSPSYSPSSPRYTPQSPTYTPSSPSYSPSSPSYSPTSPKYTPTSPSYSPSSPEYTPTSPKYSPTSPKYSPTSPKYSPTSPTYSPTTPKYSPTSPTYSPTSPVYTPTSPKYSPTSPTYSPTSPKYSPTSPTYSPTSPKGSTYSPTSPGYSPTSPTYSLTSPAISPDDSDEEN 7oop_J mol:protein length:67 DNA-directed RNA polymerases I, II, and III subunit RPABC5 @@ -1360717,7 +1360717,7 @@ MWKDKEFQVLFVLTILTLISGTIFYSTVEGLRPIDALYFS 7oos_A mol:na length:10 RNA (5'-R(CPAPAPAPGPAPAPAPAPG)-3') CAAAGAAAAG 7oos_B mol:na length:11 DNA (5'-D(CPTP(RWT)PTPCPTPTPTPG)-3') -CT05KTCTTTG +CTTCTTTG 7oot_A mol:protein length:141 Interferon regulatory factor 4 MGSHHHHHHSAALEVLFQGPGGNGKLRQWLIDQIDSGKYPGLVWENEEKSIFRIPWKHAGKQDYNREEDAALFKAWALFKGKFREGIDKPDPPTWKTRLRCALNKSNDFEELVERSQLDISDPYKVYRIVPEGAKKGAKQL 7oot_B mol:protein length:141 Interferon regulatory factor 4 @@ -1364753,7 +1364753,7 @@ GSHMEYELPEDPKWEFPRDKLTLGKPLGEGCFGQVVMAEA 7ozz_A mol:na length:10 RNA (5'-R(CPAPAPAPGPAPAPAPAPG)-3') CAAAGAAAAG 7ozz_B mol:na length:11 DNA (5'-D(CPTP(RWR)PTPCPTPTPTPG)-3') -CT05HTCTTTG +CTTCTTTG 7p00_H mol:protein length:298 Antibody fragment scFv16 MKFLVNVALVFMVVYISYIYADYKDDDDKHHHHHHHHHHLEVLFQGPDVQLVESGGGLVQPGGSRKLSCSASGFAFSSFGMHWVRQAPEKGLEWVAYISSGSGTIYYADTVKGRFTISRDDPKNTLFLQMTSLRSEDTAMYYCVRSIYYYGSSPFDFWGQGTTLTVSSGGGGSGGGGSGGGGSDIVMTQATSSVPVTPGESVSISCRSSKSLLHSNGNTYLYWFLQRPGQSPQLLIYRMSNLASGVPDRFSGSGSGTAFTLTISRLEAEDVGVYYCMQHLEYPLTFGAGTKLELKAAA 7p00_B mol:protein length:354 Guanine nucleotide-binding protein G(I)/G(S)/G(T) subunit beta-1

I do not think this error belongs to HHMsearch (the parse failed error), but to AlphaFold. May be an exception should be triggered, but not halt the whole process ?

Thanks a lot to your time, I'll report to HMMsearch too (linking this issue).

YoshitakaMo commented 2 years ago

I've also encountered this issue. I hope pdb_seqres.txt itself or the AlphaFold pipeline will be improved.

tfgg commented 2 years ago

Duplicate of #569 - please follow updates on that issue.