steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
145 stars 14 forks source link

How to build custom database? #14

Closed yakomaxa closed 1 year ago

yakomaxa commented 1 year ago

Dear Foldcomp developers, thank you for developing quite interesting software.

I'm playing with foldcomp and have a set of fcz files created from a corresponding pdb file set. Is there any means to create a custom database based on these fcz files? I want to load them as a concatenated database instead of iterating over individual fcz files in my scripts. Any suggestion will be appreciated. Thank you in advance.

Best

milot-mirdita commented 1 year ago

This is currently work in progress to be directly integrated into foldcomp.

In the meantime you can place all fcz files into one tar archive and use the MMseqs2 command tar2db. Foldcomp python will be able to read the resulting files.

This will soon be directly integrated into foldcomp.

yakomaxa commented 1 year ago

@milot-mirdita I appreciate your quick response.

I tried the protocol you suggested and successfully created a custom database readable from python. This helps a lot. Thank you very much for your help and nice work!

Also looking forward to this being integrated into foldcomp.

milot-mirdita commented 1 year ago

I'll leave this open to update you when we finish integrating database building.

yakomaxa commented 1 year ago

Yes, it's better to keep this open. I'll follow this thread to catch up updates related to database building.

milot-mirdita commented 1 year ago

The latest git version contains code to build foldcomp databases directly within foldcomp.

You can pass the --db flag to the command line util to build a foldcomp database.

yakomaxa commented 1 year ago

Is the option --db used with foldcomp compress ?

milot-mirdita commented 1 year ago

Yes, you can call:

foldcomp compress directory_with_pdb_or_mmcif_files foldcomp_db_name --db
foldcomp compress archive_with_pdb_or_mmcif_files.tar foldcomp_db_name --db
yakomaxa commented 1 year ago

Thank you for examples. It worked nicely.

Though current implementation seems to deal with directory with pdb/mmcif or their tar, is it possible to extend this function to deal with a set of fcz files?

For example, my current workflow is like:

  1. Seek over large foldcomp database, perform filtering, and dump "hit" structures as individual fcz files.
  2. Make smaller foldcomp database from dumped fcz files. Temporary fcz can be removed after database building.
  3. Iterate over new small database, and perform different filtering and dump "hit" structures.
  4. repeat 2 and 3.

At stage 1 and 3, the number of dumped structure might become very large (In my example, about 10% of full database), so file format for temporarily dumping the structures should be light-weighted format. Therefore, direct construction of foldcomp database from a set of fcz file help to save the temporary disk space. As you suggested previously, mmseq2 tar2db can create foldcomp database from tar-ed directory with a set of fcz file, but I think this functionality should be contained in foldcomp app.

By the way, I made a simple loader function to directly load fcz file into PyMOL. fcz is very good format to save space, I wish many softwares support loading fcz file directly. https://github.com/yakomaxa/load_fcz_PyMOL

milot-mirdita commented 1 year ago

Thank you for examples. It worked nicely.

Great!

Though current implementation seems to deal with directory with pdb/mmcif or their tar, is it possible to extend this function to deal with a set of fcz files?

Not yet, but we'll keep this in mind. We were busy getting the preprint ready. Should be out soon :)

For example, my current workflow is like:

1. Seek over large foldcomp database, perform filtering, and dump "hit" structures as individual fcz files.

2. Make smaller foldcomp database from dumped fcz files. Temporary fcz can be removed after database building.

3. Iterate over new small database, and perform different filtering and dump "hit" structures.

4. repeat 2 and 3.

One trick you can do now is to get the database keys from the database .lookup file based on the accessions you filtered. And then subset only the .index file based on the keys, and symlink all other files.

E.g. something like this:

# put all accessions you want in a file accessions.txt
awk 'FNR == 1 { ++findex } findex == 1 { f1[$1] = 1; next; } findex == 2 && $2 in f1 { f2[$1] = 1; next; } findex == 3 && $1 in f2 { print; next; }' accessions.txt db.lookup db.index > db_subset.index
for i in "" .lookup .dbtype; do 
  ln -s db$i db_subset$i
done

By the way, I made a simple loader function to directly load fcz file into PyMOL. fcz is very good format to save space, I wish many softwares support loading fcz file directly. https://github.com/yakomaxa/load_fcz_PyMOL

Very cool! Thank you, we'll link that to the main readme.

yakomaxa commented 1 year ago

We were busy getting the preprint ready. Should be out soon :)

Wow, I'm very excited to hear that! I believe Foldcomp and fcz will be standard software/format in the era of massive structure databases.

Trick

Thank you for sharing the trick and a sample code. This trick will help a lot to improve my workflow.

loader

Thank you for considering linking that repo from foldcomp main. I feel honored.

Anyway, I thank you very much for many of your supports and software improvement. I'm looking forward to reading your preprint!

khb7840 commented 1 year ago

Handling DB is now available with v0.0.3 release