mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

Dataset version with IDs? #2

Closed lhallee closed 8 months ago

lhallee commented 9 months ago

Hello!

Great work! Is there a version of the dataset (or a way to get a version) of the dataset where there is an additional column for Uniprot ID or equivalent?

lhallee commented 8 months ago

Just following up about this @mheinzinger

mheinzinger commented 8 months ago

Hi :) thanks for your interest in our work and sorry for the delayed response (vacation :)). I put the IDs from our dataset splits here: https://rostlab.org/~deepppi/prostt5_IDs.tar.gz

lhallee commented 8 months ago

Thanks so much!!

lhallee commented 8 months ago

Hey @mheinzinger . Do you know why there is a difference in the length of the huggingface ids and the ids in the tar file? image

mheinzinger commented 8 months ago

Ah, well, sorry for the confusion. tl;dr: take the IDs from the link above as reference.

longer: for the validation set, the difference originates from how we prepared the dataset: we took an existing dataset that was already filtered by sequence and structural similarity. After performing some quality filtering (pLDDT, length, 3Di-token-diversity), we ended up with a set that appeared relatively small. In order to hit a sweet spot between diversity and quantity, we decided to expand the sructural clusters again by their 20 most diverse members (only allowing those that passed the same quality filtering as before). To avoid information leakage between test/val and train, we moved only whole clusters to the respective split. However, in the worst case, this means that we could end up with having relatively similar proteins within test/val (for train this effect is most likely marginal). As we did not want to artificially inflate our performance report, we decided to drop those cluster members (the most diverse 20) again for val/test. However, the huggingface dataset was created at a point at which we still discussed whether this is actually needed for the validation set. All final decisions/numbers in the paper were made on the reduced (L=474) val and test sets. So the link above should be considered for replicating numbers etc. We'll fix the validation set on huggingface as well - thanks for reporting!

On the train set: I am sorry but I can really not remember why there should be a difference in those. They should be identical. But given that this only affects 34 out of 17M proteins, I would hope that this does not have any measurable effect on any decision/downstream_task.

Best, Michael

lhallee commented 8 months ago

Thanks for the detailed response! I'm really only interested in using the train set for some training of my own, and it would be great to have known annotations for the sequences from the IDs. I will do some analysis to see if the lengths of the sequences match up with the IDs. Hopefully, it's just 34 cut off from the end but it is somewhere in the middle that would be more problematic. I'll report back if I find anything worth mentioning! Thanks again, Logan

lhallee commented 8 months ago

Hey @mheinzinger,

The structure ids seem off. The max token is 49, meaning they are all extra_ids. If you add 98 to all the tokens (the difference between 49 and the max foldseek token 147) they all seem to print out correctly. So not sure if they all somehow got scaled before pushing to huggingface or something but as they are the tokenizer in the ProstT5 repo decodes all the structure ids to extra ids. Best, Logan

lhallee commented 8 months ago

Ah, please disregard. I see the pretraining data is set up for adrianhenkel/lucid-prot-tokenizer.

mheinzinger commented 8 months ago

Thanks for the update and you are perfectly right: this is confusing. We'll try to clarify/fix with the next iteration because this is the opposite of straight-forward. Sorry in case you wasted time.

lhallee commented 8 months ago

No worries!! Really excited to see what I can get out of the data. Take care.

mheinzinger commented 7 months ago

Brief update: thanks to amazing @mainpyp/Adrian, we now have an updated/fixed version of the ProstT5 dataset using also the correct tokenizer. --> https://huggingface.co/datasets/Rostlab/ProstT5Dataset I'll update the other links/pre-print with the next iteration of the manuscript but maybe the fixed version already helps you