vagarwal87 / saluki_paper

Saluki, a method to predict mRNA half-lives from sequence
Apache License 2.0
22 stars 2 forks source link

how to get train data #10

Closed wawpaopao closed 7 months ago

wawpaopao commented 1 year ago

I have noticed the Zenodo dataset, but i don't know how to use. I just want the rna sequence data and the label.

davek44 commented 1 year ago

Hi, the Zenodo dataset is a ZIP file. You could download, and unzip it to reveal the directory structure. In the directory train_gru/ there are subdirectories for every fold. Within those, there are two directories data0/ and data1/ corresponding to human and mouse. Within them, there is a table genes.tsv with the gene names and half-life measurements, and there are tfrecords with the exact RNA sequence that we used and its half-life measurement.

wawpaopao commented 1 year ago

thanks! so what's the difference between f0_c0 and f0_c1, the training process?

wawpaopao commented 1 year ago

and mapping = {"A": 0, "C": 1, "G": 2, "T": 3} right?

davek44 commented 1 year ago

The 'c' numbers are just technical replicates using the same train/test split. So f0_c0 and f0_c1 use the same exact train/test split, but are different stochastic training runs from random initializations.

Yes, the nucleotide-integer mappings are correct.