songlab-cal / tape

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
https://www.biorxiv.org/content/10.1101/676825v1
BSD 3-Clause "New" or "Revised" License
666 stars 129 forks source link

Mapping to Pfam IDs #75

Open christophfeinauer opened 4 years ago

christophfeinauer commented 4 years ago

Hi,

first thanks for creating this repo, it's really useful.

One question: It's not clear to me how I can go back to the original Pfam ID for a sequence from the LMDB databases. The reason I want to do this is because I need to use species annotation in a task.

Also, I did not find information as to how the data was created (which part of Pfam, is there preprocessing etc.). Is this documented somwhere and I didn't see it?

thomas-a-neil commented 4 years ago

Hello,

Thanks for your interest in our repo!

In order to get the original Pfam ID, you'll unfortunately have to compare the sequence of residues directly. If it is helpful, you can find the mapping from Pfam index to Pfam family in s3 here s3://proteindata/data/pfam/pfam_fams_public.pkl, which would allow you to restrict your search.

The process for creating our dataset is as follows: we downloaded Pfam-A.fasta from the Pfam 31 release (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/), shuffled it, and then split into train/validation/test as described in our paper. So the id field in the lmdb doesn't correspond to any index in Pfam. We probably should have kept the Pfam id around for the type of annotation you suggest, but since we didn't use it for model training, we dropped it. The original pfam serialization script can be found here in the deprecated tensorflow repo https://github.com/songlab-cal/tape-neurips2019/blob/master/tape/data_utils/pfam_protein_serializer.py

christophfeinauer commented 4 years ago

Thanks!

Are you sure that you used Pfam 31? There are a lot of sequences in the dataset that are not in Pfam 31, but all appear in Pfam 32.

Also, if you are interested, I can send you the mapping if other people might need it.

thomas-a-neil commented 4 years ago

Ah yes, thank you for the correction. It should be most similar to Pfam 32. We downloaded Pfam-A.fasta from the "current release" ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release ftp link in March of 2019. Pfam 32 had already been released in August 2018, and the last modification to Pfam 33 was March 2020. If there are sequences that don't appear in Pfam 32, I would check Pfam 33.

And thanks for offering to send the mapping, that would be helpful to share with others!

psturmfels commented 4 years ago

+1 for the mapping to original Pfam IDs - I would be very interested in them!

christophfeinauer commented 4 years ago

Here it is

The columns are id | species | uniprot_id | pfam_id | start | end. The id is just the id in the lmdb files.

psturmfels commented 4 years ago

This is awesome! Thank you! Out of curiosity, how did you link back to the pfam_ids? Did you actually just compare every literal sequence string between the tape dataset and the Pfam release?

christophfeinauer commented 4 years ago

Yes. I just parsed Pfam-A.fasta and mapped the sequencs back to the lmdb files. With Pfam 32 there were no missing sequences. I also checked a random subset of the mapping manually and it looks good.

The script also creates a version of the lmdb databases that contains all the information about pfam mappings, species etc. I can share them if someone is interested (however, they are trivial to make with the mappings).