Closed KatarinaYuan closed 3 months ago
The sequences are listed in a .bed
file that has the chromosome and start / end index of the interval to be retrieved from a fasta file. We extend the end
index as follows: https://github.com/kuleshov-group/caduceus/blob/6ebb434f55b2ee5dab6e6f55c41581b113786dd9/src/dataloaders/datasets/hg38_dataset.py#L144
Please let me know if anything is still unclear.
Hi, In the paper, it's described for pre-training data preprocessing "The training split comprises 34,021 segments that we extend to a maximum length of 1,048,576 (2^10)". Could you please provide more details how this "extension" is done?