mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

Update predict_3Di_encoderOnly.py #9

Closed pskvins closed 4 months ago

pskvins commented 5 months ago

Fix the code so that it can deal with any non-standard amino acids.

The script didn't work with the fasta file containing non-standard amino acids other than 'U', 'Z', and 'O', occurring consecutively.

The sequence that made an error is:

>ENSP00000434417
MSAVGLVLLVLALRLRATTVKPEEGSFCSNSQVAFRDACYEFVPLGRTFRDAQSWCEGQG
GHLVFIQDEGTQWFLQKHISQDREWWIGLTWNLARNGTTEGPGTWLDTSNVTYSNWHGGQ
AAAAPDTCGHIGRGPSSEWVTSDCAQTFAFMCEFRVGQSLACEGLNATVHCGLGQVIQVQ
DAVYGRQNPHFCTQDAGRPSDLEQGCSWANVKEEVAGQCQELQSCQVAADETYFGNLCPT
QGSYLWVQYQCREALQLMVSSESFIFDNVTISLTWLLSPYIGNLSCIISTGDSHTFDPYN
PPSVSSNVTHQFTSPGEFTVFAECTTSEWHVTAQRQVTVRDKMETLSVTACSGLSQSGAG
PLCQAVFGDPLWIQVELDGGTGVTYTVLLGDITLAESTTQKGSLPYNLILDRETQKLMGP
GRHRLEIQATGNTTTSTISRNITVHLVELLSGLQASWASDHLELGQDLLITISLAQGTPE
ELTFEVAGLNATFSHEQVSFGEPFGICRLAVPVEGTFLVTMLVRNAFSNLSLEIGNITIT
APSGLQEPSGMNAEGKSKDKGDMEVYIQPGPYVDPFTTVTLGWPDNDKELRFQWSCGSCW
ALWSSCVERQLLRTDQRELVVPASCLPPPDSAVTLRLAVLRGQELENRAEQCLYVSAPWE
LRPRVSCERNCRPVNASKDILLRVTMGEDSPVAMFSWYLDNTPTEQAEPLLDACRLRGFW
PRSLTLLQSNTSTLLLNSSFLQSRGEVIRIRATALTRHAYGEDTYVISTVPPREVPACTI
APEEGTVLTSFAIFCNASTALGPLEFCFCLESGSCLHCGPEPALPSVYLPLGEENNDFVL
TVVISATNRAGDTQQTQAMAKVALGDTCVEDVAFQAAVSEKIPTALQGEGGPEQLLQLAK
AVSSMLNQEHESQGSGQSLSIDVRQKVREHVLGSLSAVTTGLEDVQRVQELAEVLREVTC
RSKELTPSAQWEASLALQHASEALLTVSAKARPEDQRRQAATRDLFQAVGSVLEASLSNR
PEEPAEASSSQIATVLRLLRVMEHVQTTLLLGKLPGGLPAMLATPSISVYTNRIQPWSWQ
GSSLRPDAADSATFMLPAASSLSSLEGGQEPVDIKIMSFPKSPFPARSHFDVSGTVGGLR
VTSPSGQLIPVKNLSENIEILLPRHSQRHSQPTVLNLTSPEALWVNVTSGEATLGIQLHW
RPDIALTLSLGYGYHPNKSSYDAQTHLVPMVAPDELPTWILSPQDLRFGEGVYYLTVVPE
SDLEPAPGRDLTVGITTFLSHCVFWDEVQETWDDSGCQVGPRTSPYQTHCLCNHLTFFGS
TFLVMSNAINIHQTAELFATFEDNPVVVTTVGCLCVVYVLVVIWARRKDAQDQAKVKVTV
LEDNDPFAQYHYLVTVYTGHRRGAATSSKVTVTLYGLDGEREPHHLADPDTPVFERGAVD
AFLLSTLFPLGELRSLRLWHDNSGDRPSWYVSRVLVYDLVMDRKWYFLCNSWLSINVGDC
VLDKVFPVATEQDRKQFSHLFFMKTSAGFQDGHIWYSIFSRCARSSFTRVQRVSCCFSLL
LCTMLTSIMFWGVPKDPAEQKMDLGKIEFTWQEVMIGLESSILMFPINLLIVQIFQNTRP
RVAKEQNTGKWDRGSPNLTPSPQPMEDGLLTPEAVTKDVSRIVSSLFKALKVPSPALGWD
SVNLMDINSLLALVEDVIYPQNTSGQVFWEEAKKREDPVTLTLGSSEMKEKSQCPKPKAA
RSGPWKDSAYRQCLYLQLEHVEQELRLVGPRGFSQPHSHAQALRQLQTLKGGLGVQPGTW
APAHASALQVSKPPQGLPWWCILVGWLLVAATSGVAAFFTMLYGLHYGRASSLRWLISMA
VSFVESMFVTQPLKVLGFAAFFALVLKRVDDEEDTVAPLPGHLLGPDPYALFRARRNSSR
DVYQPPLTAAIEKMKTTHLKEQKAFALIREILAYLGFLWMLLLVAYGQRDPSAYHLNRHL
QHSFTRGFSGVLGFREFFKWANTTLVSNLYGHPPGFITDGNSKLVGSAQIRQVRVQESSC
PLAQQPQAYLNGCRAPYSLDAEDMADYGEGWNATTLSNGSTRARTNVKGIPSGANSLCTG
EEATWSPWGLIAKARQEFSAISLTTPGWTP*PELCLWSPLSTTPTSTCSALSR*R*RPAL
WAPFLRTRPCRASACTPSPTAGTPSW*RQSSSTSSSSSTTWWCRASA*VKRRGAISAASG
TFWSWPSSWPAGAPWRCL*RGLSWPKGTSSAAGTTGRKASASVRQQQPMPPLATSLPSWY
SCPQ*SFGICSG*IPK*T*SRQPYAVPGATFQAL*LSSLPCSWLTPSRQT*YLVGNSVPT
KPSLMRRRRWSAFSWESSTTRRSWTIAQCLAPSSLDPALFL*HLWC*TCLSLSSWWPSVR
SKNTISCRRKGRS*ICC**KYSVSWALSLRERSLEAAGSSLGLCPRLATLDQHKLCPRT

which is from https://github.com/davidemms/Open_Orthobench/raw/master/BENCHMARKS/Input/Homo_sapiens.GRCh38.pep.all.fa

Please look into the change and merge it if you think it is okay to incorporate this code.

mheinzinger commented 5 months ago

Good point, true. We did not encounter those which is why we did not run into this issue so far. I also thought that the tokenizer maps input tokens that are not part of the vocabulary automatically to the unknown token. But better to do it explicitly to make things more clear. Before merging just one Q: can you briefly elaborate on your specific mapping of ambiguous amino acid, i.e., standard_aa_dict.update({"B": "D", "Z": "E", "J": "L"}) ?

pskvins commented 5 months ago

Thank you for looking over it! For standard_aa_dict.update({"B": "D", "Z": "E", "J": "L"}), it was a suggestion from Milot Mirdita since it is how mmseqs deals with it, but after asking him about it one more time, it might be better to remove that line since foldseek does a more complicated job when there are non-canonical amino acids (map non-canonical three-letter residues to different amino acids).

And if you didn't deal with non-canonical amino acids during training (i.e. changing them into or X), then I believe it should be better to change all of them into 'X'. If this is the case, then I will remove the line you mentioned.

mheinzinger commented 5 months ago

Thanks for clarification :) During training, any "unknown" token was mapped to "X", so if there is no good reason to safely assume that your initial mapping is correct, I would vote for the safer option and rather tell the model "we don't know" aka mapping to "X".

pskvins commented 5 months ago

I've removed the line standard_aa_dict.update({"B": "D", "Z": "E", "J": "L"}) from the code. :)

Additionally, I have a question regarding a specific part of the file. In predict_3Di_encoderOnly.py, specifically at line 78, there's a line of code that replaces "/" and "." with "_" in uniprot_id (uniprot_id = uniprot_id.replace("/", "_").replace(".", "_")). I understand that this may be necessary for embed.py, but I'm wondering if it's redundant in predict_3Di_encoderOnly.py. This script outputs a fasta file, and it seems that generate_foldseek_db.py is the subsequent step after predict_3Di_encoderOnly.py. The inconsistency in uniprot_id formatting between the amino acid fasta file and the 3di fasta file could potentially bother the process for users (which I actually experienced once with https://github.com/davidemms/Open_Orthobench/tree/master/BENCHMARKS/Input/Caenorhabditis_elegans.WBcel235.pep.all.fa).

Would it be okay for me to also erase the line I mentioned above if it doesn't introduce any potential errors by its removal?

mheinzinger commented 5 months ago

Yes, please remove this line! - The sole reason for this being there is historical ... I used this code-base at one point to write embeddings as H5 files which did not like those special characters as it interpreted it as file structure. You are completely right: much better not to change headers if not urgently needed to avoid consistency problems later. Feel free to adjust :)

pskvins commented 4 months ago

I'm sorry for being late... I deleted the code I mentioned above for two files, except embed.py.

Please look over and let me know if anything looks wrong :)

mheinzinger commented 4 months ago

Great, thanks!