instadeepai / nucleotide-transformer

🧬 Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2
Other
445 stars 52 forks source link

The length of input sequence for SegmentNT #65

Closed hezt closed 4 months ago

hezt commented 4 months ago

Hello Team,

I'm trying to use your model to predict splice sites on custom sequences. Would you please share if there's any limit for input sequence, such as length and context. Like for SpliceAI, it needs 5000bp contexts on each side, do you have any requirement?

Thanks Zitong

dallatt commented 4 months ago

Hello @hezt ,

The SegmentNT model we released has been trained on sequences of 30,000 bp and has been evaluated on {10kbp, 20kbp,.., 100kbp}. The resulting performance is shown in Fig3, Panel d. There is no limit or constraint on the input sequence length (except the accelerator memory that is going to run out eventually), however from the figure you can see that the model reaches its best performance when evaluated on sequences of 50,000 bp so I would advise you use this sequence length to get optimal results!

Hope this helps, Hugo