How many sequences were used for pre-training?

jerryji1993 / DNABERT

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

https://doi.org/10.1093/bioinformatics/btab083

Apache License 2.0

588 stars 158 forks source link

How many sequences were used for pre-training? #25

Closed hjgwak closed 3 years ago

hjgwak commented 3 years ago

Hi,

I'm trying to use this model for microbiome data. As a practice, I trained a model using a small dataset (10K sequences extracted from viruses genomes). Unfortunately, fine-tuned model using that pre-trained model shows extremely bad performance like a random model.

Can you inform me how many sequences were used for pre-training?

Zhihan1996 commented 3 years ago

Hi,

We do random sampling on the human genome to generate training data. The total length of training data is roughly 5 times of the entire human genome.

hjgwak commented 3 years ago

Hi,

Can I get the exact number of sequences used for training?

For instance, given example_6_3k.txt includes 3,000 sequences.

JY-97 commented 3 years ago

Hi,

I'm trying to use this model for microbiome data. As a practice, I trained a model using a small dataset (10K sequences extracted from viruses genomes). Unfortunately, fine-tuned model using that pre-trained model shows extremely bad performance like a random model.

Can you inform me how many sequences were used for pre-training?

What finetune task did you choose? Guess the length of finetune sequences and the preprocessing step matter a lot

Zhihan1996 commented 3 years ago

Hi,

Can I get the exact number of sequences used for training?

For instance, given example_6_3k.txt includes 3,000 sequences.

Sorry, I forget the exact number. It should be around 5 million sequences.

hjgwak commented 3 years ago

@Zhihan1996 Thanks for your answer.

I have one last question. Did you sample DNA sequences from one strand or both? In other words, was the reverse complement sub-sequence involved in the training dataset?

jerryji1993 commented 3 years ago

Hi @hjgwak,

We have used both strands. Let me know if there are additional questions.

Thanks, Jerry

terry-r123 commented 1 year ago

Hi @hjgwak and @Zhihan1996, Thanks for your interesting work on DNABERT.

Can the preprocessed pertaining data be available as the file format example_6_3k.txt?