gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
455 stars 136 forks source link

Calculation of genome size #164

Open bioramg opened 4 years ago

bioramg commented 4 years ago

Hi, I would like to calculate the genome size from the whole genome ONT nanopore reads (~8.7 GB) for a plant genome. My aim is to assemble the mitochondrial genome from ONT nanopore reads. I knew the approximate chloroplast genome size (165kb) and the mitochondrial genome is ~1 MB. But I do not know the depth coverage and expected genome size. I am very beginner and don't know how to calculate both depth coverage and genome size. I have used jellyfish and 27 kmer length to calculate the genome size but not able to get a good result. input parameter: ./jellyfish count -m 27 -s 100M -t 10 -C ONT.fastq

I have enclosed a histogram file. Please help me to find out the expected genome size. Thank you. histogram.txt

brobr commented 4 years ago

Have you seen this tutorial? https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/

bioramg commented 4 years ago

Yes. I tried and have enclosed a histogram file for your reference. I could not able to get a graph. Please look onto that file. histogram.txt

brobr commented 4 years ago

It looks like very unique. Maybe your k-mer (-m) parameter is too high? Play around with that value. I did it in my case for a fungal genome of 19 MB and got a nice distributed peak with 10 and 11-mers but with 12-mer it was almost already nearly asymptotic (i.e. almost all on '1', ie most 12-mers are only found once in that genome)

bioramg commented 4 years ago

Yes. I can able to get with 11 - 17 kmer values. Thank you for your suggestion.

bioramg commented 4 years ago

I used the following commands for Nanopore reads (Total raw read file size is 8.7 GB) ./jellyfish count -m 15 -s 100M -t 10 -C ONT.fastq By using this command, I can able to see k-mer distribution evenly and identified that the maximum k-mer distribution on 15-mer. So, I calculated according to the previously suggested tutorial. I have some questions for better understanding:

  1. In my histogram file, there is no high peak at no.1. So, I calculated like this: sum(as.numeric(data15[1:10000,1]*data15[1:10000,2]))

The expected genome size by using the above commands with 15-mer: 1.8 GB

Is it correct? Please check my attachment. 15mer_histogram.txt

  1. Nanopore reads are single-strand and do not have paired reads. So, shall I delete -C option in the command?

  2. What is the use of -s? I used 100M (I have seen this number in seqanaswers page) here and but 5G used in the tutorial. Is it represent the total ONT raw read file size). How to use exactly?

  3. How to calculate the single-copy region? On which point we should calculate. In the tutorial they used from 2 - 28. should I check at which point the 0 is distributed?

Again, I have used another command which does not include -C parameter: ./jellyfish count -m 14 -s 100M -t 10 ONT.fastq

sum(as.numeric(histo14[1:10000,1]*histo13[1:10000,2]))/41

The expected genome size is 1.13 GB. I tried with 15-mer but it's not distributed evenly. 14mer_histogram_2.txt

I would like to know which one is correct. Please suggest to me.

Thank you.