kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
137 stars 14 forks source link

What does the length extension mean for pre-training dataset? #6

Closed KatarinaYuan closed 3 months ago

KatarinaYuan commented 3 months ago

Hi, In the paper, it's described for pre-training data preprocessing "The training split comprises 34,021 segments that we extend to a maximum length of 1,048,576 (2^10)". Could you please provide more details how this "extension" is done?

yair-schiff commented 3 months ago

The sequences are listed in a .bed file that has the chromosome and start / end index of the interval to be retrieved from a fasta file. We extend the end index as follows: https://github.com/kuleshov-group/caduceus/blob/6ebb434f55b2ee5dab6e6f55c41581b113786dd9/src/dataloaders/datasets/hg38_dataset.py#L144

Please let me know if anything is still unclear.