mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
178 stars 15 forks source link

Entry error between aa and 3di string #26

Open rakeshr10 opened 3 months ago

rakeshr10 commented 3 months ago

@mheinzinger I get this error on some of the fasta files when I use generate foldseek db script. Is it because there is a mismatch in the fasta header or the 3di sequence was not predicted for a particular sequence.

Error: entry id in amino-acid FASTA file has no corresponding 3Di string

mheinzinger commented 3 months ago

Hm, would you mind sharing an example for this specific error? - I did not encounter it so far. The only thing I could imagine is that there are some very weird characters in your headers which cause problems (or maybe that some IDs appear twice which might also cause problems). Maybe check for those by simply mapping each of the sequence IDs to some unique hash string and see whether the problem persists when you re-run the script using those unique headers

rakeshr10 commented 3 months ago

@mheinzinger If this happens does that mean the entire file will not be converted to foldseekdb or it only affects a particular sequence in a file.

Does the tool expect the fasta headers to be in a specific format?

mheinzinger commented 3 months ago

No, the tool does not expect the fasta headers to be in a specific format but in case you have very exotic things written there, it might just lead to weird/unforseeable downstream effects. Therefor, (mostly for debugging), I recommended replacing the headers with sth that has to work (e.g. just a string of letters/numbers, nothing else) and see whether the problem persists.