marcellszi / rna3db

A dataset for training and benchmarking deep learning models for RNA structure prediction
MIT License
33 stars 3 forks source link

Missing RNA Families in JSON Files #5

Open Rungetf opened 2 months ago

Rungetf commented 2 months ago

Hi,

Could please you add the RNA families to the JSON files?

This would be a helpful information to split a validation set.

Thanks in advance!

marcellszi commented 2 months ago

Hi @Rungetf,

I haven't really considered adding the families to the JSON files, since the Tabular API can already be used to parse the Infernal homology search for the chains.

What do you mean by splitting a validation set? Is your goal to end up with structurally non-redundant training, testing and validation sets, instead of just the training and testing sets provided by RNA3DB's release? If so, there is no need to manually look at the families. The graph components can be used for this out of the box.

You can take components from cluster.json from rna3db-jsons.tar.gz (see Releases) and assign each of them to any one of the sets without risking data leakage.

Rungetf commented 2 months ago

Hi @marcellszi,

First thanks for the fast reply!

Yes, I wanted to keep the exact test set from the release (for reproducibility reasons) but split a validation set from the training data for hyperparameter optimization.

I first considered using the components for the split (which would make sense in terms of homologies) but there are only few components in the training set and most of the sequences seem to be assigned to component_1. So, splitting by components does not seem to be the best way to get a validation set of reasonable size.

Since splitting by clusters does not make sense due to homologies, I wanted to at least consider the families for splitting.

I think the tabular API is great and actually I ended up using it, but it would have been nice to have all the information at one point.

However, since you mention in the paper that RNA3DB is specifically designed for training deep learning methods, it might be a good idea to think about splitting a validation set by components for the next release ;-) .

marcellszi commented 2 months ago

Hi again @Rungetf,

Sorry about taking so long to respond, and thanks again for showing interest in RNA3DB.

I would strongly caution against just splitting by families. The point of RNA3DB components is precisely to avoid the issues caused by naively dividing the families between sets. There are, for example, many Rfam families that share homology. You can see our manuscript for more details.

So, splitting by components does not seem to be the best way to get a validation set of reasonable size.

There are 217 sequences outside of component_1 in the training set. This is roughly 13% of the entire data set. I figured this should be enough.

How much of the entire dataset would you like? Something like 20%? You could probably achieve that by picking a more stringent E-value threshold, which would result in more components. Although note that this will increase the chance of data leakage.

If you wanted to avoid changing the test set, you could even re-build the graph with RNA3DB for just the training set at the lower threshold. This way you may marginally increase the chance of data leakage between the validation and training sets, but not between the training/validation and test sets.

However, since you mention in the paper that RNA3DB is specifically designed for training deep learning methods, it might be a good idea to think about splitting a validation set by components for the next release ;-) .

I agree that being able to also create validation sets out-of-the-box is a good idea. I will include it in the next release (coming very soon).

Rungetf commented 2 months ago

Hi @marcellszi,

I completely agree that splitting by families only is suboptimal.

However, with the limited amount of training data there is a strong trade-off between keeping as much training components as possible and getting the best signal for non-homologous samples during HPO. Taking a lot of components from the training set (it's not mainly about the number of sequences, more about the components) strongly reduces the diversity of training samples, so for now, I would still go for splitting by families for the validation set, hoping to at least get some signal. Also, data leakage between validation and training set is less of an issue as long as the validation set is non-homologous with the test set.

That said, I very much appreciate that you consider building a validation set for the next release!

Actually, I would probably rather reduce the size of the test set a bit in favor of a validation set instead of taking components from train to keep as much training data as possible, but that's my personal opinion of course.