makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

question about memory requirements and determining appropriate kmer size #13

Open eocampbe opened 4 years ago

eocampbe commented 4 years ago

Hi there,

I have completed an initial analysis using DiscoverY in the female+male mode, and I am wondering how I might determine whether the kmer size I used is optimal for the data I have. For the analysis I've done so far, I used the default size of 25, but I understand that this may need to be adjusted based on the specific characteristics of the genome I'm working with. I have plotted the results of my analysis (attached), and there seem to be a large number of kmers with very low similarity to the female genome, which is of quite good quality, but high depth. The organism we're working on has a neo sex chromosome system, so I suspect the Y regions are clustering in with the X regions on the bottom right of the graph (confirming this was actually my reason for using DiscoverY), however I'm less sure about why there might be so many male contigs that have very low similarity to the female, but rather high coverage. I don't know if this is a result of my kmer parameter or something else, but I'm hoping you might be able to offer some advice.

In addition, this analysis required about 720 Gb of RAM, which is about double what is estimated in the paper and is almost the maximum amount of RAM I'm allowed to ask for per node of the cluster I'm using. Can DiscoverY run in parallel so that I can spread this memory out across multiple nodes? I don't see anything in the documentation or the paper that mentions this, but it would be very helpful for subsequent analyses.

Thanks, Erin graph.pdf

sheinasim commented 4 years ago

Hello!

I am not a part of the group who wrote this software, but I was able to calculate the appropriate kmer size by applying the random boundary formula, (Eq. 2 in Fofanov et al 2004).

kmer size = log[ genome_size (1 - error_rate_of_reads) / error_rate_of_reads ] / log(4)

Hope this helps!