question about memory requirements and determining appropriate kmer size

Hi there,

I have completed an initial analysis using DiscoverY in the female+male mode, and I am wondering how I might determine whether the kmer size I used is optimal for the data I have. For the analysis I've done so far, I used the default size of 25, but I understand that this may need to be adjusted based on the specific characteristics of the genome I'm working with. I have plotted the results of my analysis (attached), and there seem to be a large number of kmers with very low similarity to the female genome, which is of quite good quality, but high depth. The organism we're working on has a neo sex chromosome system, so I suspect the Y regions are clustering in with the X regions on the bottom right of the graph (confirming this was actually my reason for using DiscoverY), however I'm less sure about why there might be so many male contigs that have very low similarity to the female, but rather high coverage. I don't know if this is a result of my kmer parameter or something else, but I'm hoping you might be able to offer some advice.

In addition, this analysis required about 720 Gb of RAM, which is about double what is estimated in the paper and is almost the maximum amount of RAM I'm allowed to ask for per node of the cluster I'm using. Can DiscoverY run in parallel so that I can spread this memory out across multiple nodes? I don't see anything in the documentation or the paper that mentions this, but it would be very helpful for subsequent analyses.

Thanks, Erin graph.pdf

makovalab-psu / DiscoverY

question about memory requirements and determining appropriate kmer size #13