steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
145 stars 14 forks source link

Subsetting databases #39

Open patrickbryant1 opened 11 months ago

patrickbryant1 commented 11 months ago

Hi,

Thank you for the great resource!

I am having trouble subsetting databases and decompressing subsets of the databases you provide here: https://foldcomp.steineggerlab.workers.dev

According to the instructions, I should be able to decompress a subset of a database given an "id_list.txt".

This is how I do it for e.g. A. thaliana:

head -n 1 data/a_thaliana.lookup 0 AF-A0A178UFC4-F1-model_v4.pdb 0

As I understand it, the ID here is "AF-A0A178UFC4-F1-model_v4".

Now, I write this into a file called id_list.txt, then I run the command: foldcomp decompress --id-list id_list.txt data/a_thaliana

with the response: Decompressing files in data/a_thaliana using 1 threads Output directory: data/a_thaliana_pdb/ [Warning] AF-A0A178UFC4-F1-model_v4 not found in database.

I have tried many different ways of naming the ids based on what is in a_thaliana.lookup, but nothing seems to work. The same using mmseqs to subset the database: """ createsubdb --subdb-mode 0 --id-mode 1 id_list.txt a_thaliana test_sel/output_foldcomp_db

MMseqs Version: ad6dfc66d7bbc4fd626fc19adf10ba587bc137c4 Subdb mode 0 Database ID mode 1 Verbosity 3

Could not find name AF-A0A178UFC4-F1-model_v4 in lookup Time for merging to output_foldcomp_db: 0h 0m 0s 1ms Time for processing: 0h 0m 0s 34ms """

Can you please explain what I am doing wrong and how to properly specify the IDs?

Best,

Patrick

patrickbryant1 commented 11 months ago

I noticed, this seems to work with afdb_rep_v4. Perhaps something is missing from the reference genomes?

khb7840 commented 11 months ago

I'm sorry there was a bug at assigning mode for database reading. Thank you for notifying this and please check if this is solved in the latest version.

patrickbryant1 commented 11 months ago

Hi, Great - thanks. What do you mean with the latest version:

  1. Of the database from https://foldcomp.steineggerlab.workers.dev
  2. Of Foldcomp
  3. Something else(?)
khb7840 commented 11 months ago

Latest version of Foldcomp. Subsetting 'a_thaliana' should work with foldcomp of latest commit.

patrickbryant1 commented 11 months ago

Ok, great. Does this include the binaries you distribute or only the pip installation/git clone? Do you know why mmseqs2 seems to fail on the same files? Is there something missing in the subsetting instructions there as well?

khb7840 commented 11 months ago

Please use git clone to get the latest updare. Python distribution is not updated with the latest commit. For the mmseqs2 part, I'm not sure what happened. I'll check this with mmseqs2 developers.

patrickbryant1 commented 11 months ago

Ok, thanks for the help!