Closed hjgwak closed 3 years ago
Hi,
We do random sampling on the human genome to generate training data. The total length of training data is roughly 5 times of the entire human genome.
Hi,
Can I get the exact number of sequences used for training?
For instance, given example_6_3k.txt includes 3,000 sequences.
Hi,
I'm trying to use this model for microbiome data. As a practice, I trained a model using a small dataset (10K sequences extracted from viruses genomes). Unfortunately, fine-tuned model using that pre-trained model shows extremely bad performance like a random model.
Can you inform me how many sequences were used for pre-training?
What finetune task did you choose? Guess the length of finetune sequences and the preprocessing step matter a lot
Hi,
Can I get the exact number of sequences used for training?
For instance, given example_6_3k.txt includes 3,000 sequences.
Sorry, I forget the exact number. It should be around 5 million sequences.
@Zhihan1996 Thanks for your answer.
I have one last question. Did you sample DNA sequences from one strand or both? In other words, was the reverse complement sub-sequence involved in the training dataset?
Hi @hjgwak,
We have used both strands. Let me know if there are additional questions.
Thanks, Jerry
Hi @hjgwak and @Zhihan1996, Thanks for your interesting work on DNABERT.
Can the preprocessed pertaining data be available as the file format example_6_3k.txt?
Hi,
I'm trying to use this model for microbiome data. As a practice, I trained a model using a small dataset (10K sequences extracted from viruses genomes). Unfortunately, fine-tuned model using that pre-trained model shows extremely bad performance like a random model.
Can you inform me how many sequences were used for pre-training?