kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

question about training lengths #24

Closed leannmlindsey closed 2 months ago

leannmlindsey commented 2 months ago

I noticed in genomic_benchmar.yaml that you have various train_len for each task. I was just wondering why there were such wide differences in the training lengths? Is it just the maximum value in the dataset?

example: dummy_mouse_enhancers_ensembl: train_len: 1210 classes: 2 max_length: 1024 demo_coding_vs_intergenomic_seqs: train_len: 100_000 classes: 2 max_length: 200 demo_human_or_worm: train_len: 100_000 classes: 2 max_length: 200 human_enhancers_cohn: train_len: 27791 classes: 2 max_length: 500 human_enhancers_ensembl: train_len: 154842 classes: 2 max_length: 512 human_ensembl_regulatory: train_len: 289061 classes: 3 max_length: 512 human_nontata_promoters: train_len: 36131 classes: 2 max_length: 251 human_ocr_ensembl: train_len: 174756 classes: 2 max_length: 512 phage_classification: train_len: 4000 classes: 2 max_length: 4000

leannmlindsey commented 2 months ago

Sorry, I realized from looking at the config files that the training length was the number of sequences in the dataset.