mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

Foldseek not liking generate_foldseek_db.py output TSVs #15

Closed martinez-zacharya closed 2 months ago

martinez-zacharya commented 2 months ago

I'm trying to convert 3Di tokens from ProstT5 and the associated amino acids into foldseek databases using generate_foldseek_db.py. However, when I try this, I receive an error like this Cannot open index file testdb.index.0 . Do you happen to know if I need a specific foldseek release to do this? For reference, I'm using version 2.8bd520. The test files I've been using are attached as well, with superfluous txt endings so GitHub allows me to upload them.

test_3di.fasta.txt query.fasta.txt

kWeissenow commented 2 months ago

Unfortunately, I cannot seem to open your query.fasta.txt file (says 'Not found'). I generated some dummy sequences (all alanin) to run with your 3Di sequences and successfully generated foldseek databases with both our local installation and version 2.8bd520. Could you double-check that this isn't a file permission error? Is the directory writable for the user you're running the script with?

martinez-zacharya commented 2 months ago

Sorry that the fasta file isn't able to be found, I'm unsure why that is. I don't think I have a file permission problem, but I may just not know enough about it. I'm able to make subdirectories, files, etc. in the directory though. Do you have a way to check if this is indeed the problem?

kWeissenow commented 2 months ago

I came across a similar error a user had with database creation (https://github.com/sokrypton/ColabFold/issues/589). Do you try to run the script on a shell with very strict limits on memory or runtime?

If that's not the issue, you could try commenting out the last three lines of the script (by putting a '#' as the first character) and run again. You should then find the files 'aa.tsv', '3di.tsv' and 'header.tsv' in your directory. If so, please attach them for me to take a look.

martinez-zacharya commented 2 months ago

Thank you for troubleshooting this with me! The shell I'm using doesn't have any limits on runtime or memory, and the total amount of RAM is 64 GB. I've attached the 3 .tsv that were generated from the script test_TSVs.zip

martinez-zacharya commented 2 months ago

Actually, I think it has something to do with permissions, since I was just able to successfully generate the DBs on a different computer with just 32GB of RAM

kWeissenow commented 2 months ago

Good to hear! Thanks for letting us know.