steineggerlab / foldcomp

Compressing protein structures effectively with torsion angles
GNU General Public License v3.0
152 stars 14 forks source link

ESMFold database header issues #51

Open valentynbez opened 6 months ago

valentynbez commented 6 months ago

When I extract FASTA from highquality_clust30 I receive the following headers.

>ESMFOLD V0 PREDICTION FOR MGYP000138429313
>ESMFOLD V0 PREDICTION FOR MGYP001595280761
...

I use FoldComp for a downstream application, and per FASTA specification in this case each sequence will have a header ESMFOLD, which is not unique. The unique id is stored in the comment. I can run sed on it, but this solution feels hacky. The highquality_clust30.lookup looks appropriate:

0       MGYP002174220927        0
1       MGYP000064029927        0

Do you have recommendations on how to get proper FASTA headers?

Cheers V

khb7840 commented 1 month ago

Sorry for the late response. I've changed the default to use id/filename when extracting sequences in 412c7a8 and introduced use-title flag if title is needed.