Pre-trained model for genomic sequences

google-research / bigbird

Transformers for Longer Sequences

https://arxiv.org/abs/2007.14062

Apache License 2.0

563 stars 101 forks source link

Pre-trained model for genomic sequences #2

Open ptynecki opened 3 years ago

ptynecki commented 3 years ago

Good morning,

Thank you for sharing the paper, code and pre-trained model for NLP text data. Your research work results are impressive. Because I am developing embeddings solutions for genes and proteins, the application to genomic sequences part interests me the most.

Is there any chance to try BigBird nucleotide-based pre-trained model for research purpose? I would like to include it in my benchmark and compare it with existing non-contextual embeddings (Word2Vec, FastText and Glove).

Regards, Piotr

manzilz commented 3 years ago

Hi Piotr,

Thanks for interest in our work. We are working on releasing the model pretrained on DNA fragments.

Thanks!

project-delphi commented 3 years ago

Might we get the code for genome pretraining, as well as the pretrained network weights themselves please?

jonas27 commented 3 years ago

Hi, any update on this?

imanmal1k commented 2 years ago

Greetings, manzilz

I also work on nucleotide-based language models and would appreciate if you could release a pretrained-model, for me to use as a benchmark

Thanks a lot!

FAhtisham commented 2 years ago

Hi,

Any update about the release ?

ItamarChinn commented 1 year ago

@manzilz any updates on the release? 😃

cbirchsy commented 1 year ago

Any update on this? This would be very useful for embedding dna/rna sequences

bbpxq commented 1 year ago

I'm absolutely sure that they DON'T HAVE ANY PLANS about releasing DNA models.

yurakuratov commented 1 year ago

We have replicated BigBird pre-training on more recent T2T human genome assembly. The model is available via HuggingFace: https://huggingface.co/AIRI-Institute/gena-lm-bigbird-base-t2t. Any kind of feedback is welcome!