Dataset curation questions

Wangchentong commented 1 year ago

Expected Behavior

Hi guys, thanks a lot for making this amazing tool!

here is my backgorund

I would like to use AFDB_Cluster to expand structure databse(PDB dataset) to get some improvement on related tasks(structure prediction,protein scaffold generation,etc).

here is my purpose :

The situation i met now is AFDB_Cluster has 2.2 million none-singleton clusters. I want to scale down this number to 1/10 and keep the most diverse structure compared to PDB database(dissimilar both from sequence level and structure level). But i dont know what's the most proper way to do this.

Here is my current plan:

get high quality clusters from AFDB clusters filtered by two value (nMem,avgPlddt) for dark clusters, i use this cut off : nMem > 3,avgPlddt > 80 => get 77263 clusters for not dark clusters, i use this cut off : nMem > 5,avgPlddt > 85 => get 164541 clusters I set diffenrent curoff by the observation of clear Plddt and nMen gap between dark and undark protein
Composing a new dataset using highest lddt structure of each filtered clusters.
Using mmseqs2 with min-seq-identity 0.4 to cluster this dataset to get final dataset , evenly sample each cluster when training model

Here is some place i hope to improve but dont know how:

from my current approach, the protein dissimilar from PDB is not biased more (may implicytly biased by low filter cutoff of dark protein)
sample only one structure each cluster may be not the best way., since there is some cluster have a lot of proteins.
singleton clusters is dicarded and i dont know if it is useful.

Very Aprreciate any suggestions to my current plan! Love you!

yeojingi commented 1 year ago

Here are some clarifications and questions related to your proposed plan:

Your approach includes: 1) Applying specific criteria to filter AFDB clusters. 2) Identify proteins structurally and sequentially most divergent from the PDB to utilize them in your downstream ML models.

I would like further clarification on:

What is the purpose of using MMseqs2 cluster in your final dataset configuration? Is this for your training model?
In your statement "the protein dissimilar from PDB is not biased more", could you please elaborate on what you mean by "biased"?
Regarding your question about singleton clusters - are you referring to those in your MMseqs2 clusters or those discarded during AFDB cluster formation?

For question 2 - "sample only one structure each cluster may be not the best way, since there is some cluster have a lot of proteins", it's worth mentioning that if representative structures are chosen from structurally clustered sets, these should already reflect the structural diversity of all set members.

If I understand correctly what you are trying to achieve, then I would propose the following:

Take all 2.2 million representatives and search against the PDB. Reject all representatives that have a match within the PDB with E-value < 0.00001. Sample up to 200k structures from this set.

P.S: We are about to update the AFDB clusters database soon, the current version has a minor error. I will tell you so if it is uploaded, excuse me. You'd better to use the updated one

martin-steinegger commented 1 year ago

@yeojingi thank you so much for helping with this issue.

@Wangchentong It seems you're on your way to building a diverse, non-redundant protein dataset from the AFDB_Cluster. Here are some suggestions based on a selection we did for training a model.

Expanding the clusters: Rather than sampling only the highest pLDDT structure of each filtered cluster as you currently do, consider expanding your clusters by taking the 20 most diverse members (or another suitable number based on your specific needs). This is doable with our filterresult command in foldseek using the 3Di sequences.

Removing low-quality and short predictions: Remove low-quality structure predictions (pLDDT<70), short proteins (length<30), or proteins with highly repetitive 3Di-strings. You might consider similar filtering criteria to ensure the quality of your dataset.

Single tones Regarding your concerns about singleton clusters, you could consider including some of them in your dataset if they offer unique or valuable structural information that isn't represented in your non-singleton clusters. I assume most singletons are not structured though.

I hope any of this is helpful in your project. Good luck!

yeojingi commented 1 year ago

Please check the download link. The updated version fixed an error. Thank you!

Wangchentong commented 1 year ago

@yeojingi

Thank you so much yeojingi, really appreciate your reply！

FIrst i want to repsponse to your clarification:
- Why use MMseqs2 to cluster in final dataset : It's a traning strategy come from AF2 supllementary(1.2.5 Filtering) . By cluster with mmseqs2, the traning data of each epoch can be seen as evenly sample each mmseqs2 cluster(with seq-identity 0.4), give more opportunity to sample protein from small size cluster compared to evenly sample each protein.
- the protein dissimilar from PDB is not biased more : what i mean "biased" is that i use lower criteria for dark protein(nMem > 3,avgPlddt > 80 ), If i understand correctly, dark protein is lack of annotation, which means it lacks of solved structure and homologue sequence, which means dark protein is dissimilar from PDB protein.
- The definition of singleton clusters ：I mean those discarded during AFDB cluster formation，with consideration of avg plddt of singleton structures, i guess this singleton is not suitable to be considered as expanding of real solved structure.
Your proposal is quite good and i will try that，i will report my porgress in this issue。
One of most thing i concerned is that for clusters, it seems that cluster with higher avg-plddt, it also has more nMem(in my figure of first comment in this issue). This phenomenon remind me the consonsus method in protein quality assesment( The bad prediction is various, but right prediction has only one), So if a cluster has low nMem, i guess i should drop it even though it has a high plddt? Or use plddt as only threshhold is quite good?

@martin-steinegger

Thank you martin! You are one of my most admired reseacher and i finally has oppotunity to talk to you.

I think filterresult will be a great choice for me to select multiple protein from one cluster, also it would be a good solution to expand pdb40 cluster with af2 database? As the pdb40 cluster strategy in af2 i mentioned above, i see lots of cluster only contain 1 or 2 member. pdb40_nMem_distribution

I am working on protein scaffold generation diffusion model, So from intuition, the diverse structure is more important than diverse sequence since in strcture diffusion since there is no input sequence but only structure! That's why i'am so intrested in exeplore AFDB cluster database to find out diverse strcture from pdb.

So i consider to use same strategy(supllmentary 1.2.5) as i mentioned above but cluster protein based on foldseek similarity rather than mmseqs2 sequence identity.

To summarize, I will try two things:

reject similar protein with fold seek by an incremental e-value and see how things going on.
sample more sequence from each cluster by using filterresult and expand pdb40 with AFDB foldseek search(i guess this catch some protein discarded in first procedure).

I will note you guys with things going on, any suggestion will be appreciated and well considered.

Again, thank you @yeojingi ! thank you @martin-steinegger !

yeojingi commented 1 year ago

One of most thing i concerned is that for clusters, it seems that cluster with higher avg-plddt, it also has more nMem(in my figure of first comment in this issue). This phenomenon remind me the consonsus method in protein quality assesment( The bad prediction is various, but right prediction has only one), So if a cluster has low nMem, i guess i should drop it even though it has a high plddt? Or use plddt as only threshhold is quite good?

I think relying on plddt would be fine. The consensus method you mentioned is useful when a redundancy check doesn't exist. But here we have an innate quality checker, the plddt value. Other criteria can contribute to removing redundant proteins to make your database more clean. But the interpretation of the effect is open and depends. The reason why the bigger clusters have good quality is that the bigger the cluster is the more proteins the AlphaFold can refer to. It is said that AF required >30 good proteins in an MSA to predict a protein.

Wangchentong commented 1 year ago

@martin-steinegger Hi martin, i whould like to report a bug about filterresult

I have a cluster database like this

af_clusDB: Size of the sequence database: 2746205 Size of the alignment database: 2746205 Number of clusters: 1011980

After cluster finished, i use foldseek filterresult to reduce redundancy of each cluaster with this command:

filterresult ../afDB/afDB_uniprot50 ../afDB/afDB_uniprot50 af_clusDB af_clust_slimDB --diff 3
foldseek createtsv  ../afDB/afDB_uniprot50  ../afDB/afDB_uniprot50 af_clust_slim_DB af_clust_slim.tsv

The af_clust_slim.tsv owns 1127851 proteins and 284719 clusters, which is much less than th original cluster database, which is not what i expected, what i want to do is reduce members per cluster but not reduce cluster nums.

steineggerlab / foldseek

Dataset curation questions #131

Expected Behavior