Closed pskvins closed 4 months ago
Good point, true. We did not encounter those which is why we did not run into this issue so far. I also thought that the tokenizer maps input tokens that are not part of the vocabulary automatically to the unknown token. But better to do it explicitly to make things more clear. Before merging just one Q: can you briefly elaborate on your specific mapping of ambiguous amino acid, i.e., standard_aa_dict.update({"B": "D", "Z": "E", "J": "L"})
?
Thank you for looking over it!
For standard_aa_dict.update({"B": "D", "Z": "E", "J": "L"})
, it was a suggestion from Milot Mirdita since it is how mmseqs deals with it,
but after asking him about it one more time, it might be better to remove that line since foldseek does a more complicated job when there are non-canonical amino acids (map non-canonical three-letter residues to different amino acids).
And if you didn't deal with non-canonical amino acids during training (i.e. changing them into
Thanks for clarification :) During training, any "unknown" token was mapped to "X", so if there is no good reason to safely assume that your initial mapping is correct, I would vote for the safer option and rather tell the model "we don't know" aka mapping to "X".
I've removed the line standard_aa_dict.update({"B": "D", "Z": "E", "J": "L"})
from the code. :)
Additionally, I have a question regarding a specific part of the file.
In predict_3Di_encoderOnly.py
, specifically at line 78, there's a line of code that replaces "/" and "." with "_" in uniprot_id
(uniprot_id = uniprot_id.replace("/", "_").replace(".", "_")
).
I understand that this may be necessary for embed.py
, but I'm wondering if it's redundant in predict_3Di_encoderOnly.py
.
This script outputs a fasta file, and it seems that generate_foldseek_db.py
is the subsequent step after predict_3Di_encoderOnly.py
.
The inconsistency in uniprot_id
formatting between the amino acid fasta file and the 3di fasta file could potentially bother the process for users (which I actually experienced once with https://github.com/davidemms/Open_Orthobench/tree/master/BENCHMARKS/Input/Caenorhabditis_elegans.WBcel235.pep.all.fa).
Would it be okay for me to also erase the line I mentioned above if it doesn't introduce any potential errors by its removal?
Yes, please remove this line! - The sole reason for this being there is historical ... I used this code-base at one point to write embeddings as H5 files which did not like those special characters as it interpreted it as file structure. You are completely right: much better not to change headers if not urgently needed to avoid consistency problems later. Feel free to adjust :)
I'm sorry for being late...
I deleted the code I mentioned above for two files, except embed.py
.
Please look over and let me know if anything looks wrong :)
Great, thanks!
Fix the code so that it can deal with any non-standard amino acids.
The script didn't work with the fasta file containing non-standard amino acids other than 'U', 'Z', and 'O', occurring consecutively.
The sequence that made an error is:
which is from https://github.com/davidemms/Open_Orthobench/raw/master/BENCHMARKS/Input/Homo_sapiens.GRCh38.pep.all.fa
Please look into the change and merge it if you think it is okay to incorporate this code.