Best practice for large datasets?

JohnsonStev commented 3 years ago

Dear Isaac,

I would like to apply mccortex on a large scale resequencing project. (~400 individuals, 1GB genome size), I read through the wiki, and here is what I think a possible workflow might look like

Build graphs for each sample and reference with one chosen kmer size
Clean each of the graphs
Merge the clean graphs
Read threading to produce link files
Clean link files
Merge the clean link files
Call the variants

Do you have any suggestion about the workflow or is there any pitfall I need to be aware of? Thank you so much.

winni2k commented 3 years ago

That should work in principle. If you don't hear back from Isaac, you might try asking @kvg.

JohnsonStev commented 3 years ago

Thanks for the answer, I am trying to merge the clean graphs all together in one single command and it took a lot of time. Is it more time saving to merge a few graphs in parallel first, then merge those merged graphs? Thank you again

kvg commented 3 years ago

McCortex loads all the graphs into memory before joining them, and yes, this can be a bit slow. I think what you've outlined would be faster, but it's not clear that the improvement would be particularly significant (I'd imagine it depends on the contents of the graphs - particularly the number of shared k-mers between each sample).

An alternate strategy that might help you is the "Join" command we wrote in a companion tool, Corticall. This assumes your graphs are stored in sorted order (with the '-s' option in mccortex commands), and then the graphs are merged linearly. This tends to be much faster than the built-in McCortex join command; I've used this to merge a couple hundred microbial genomes. The resulting joined graphs will remain compatible with all of mccortex's subcommands.

After downloading and building Corticall, the command-line for this would be:

$ java -jar build/jars/corticall.jar Join -g -g ... -g -o joined.ctx

Please let me know if that does or doesn't work for you.

JohnsonStev commented 3 years ago

Dear KVG,

Thanks for your response, I will try it.

Meanwhile I am still working on running through the whole workflow using a subset of data. I've done link threading plus link error cleaning of each sample. Now I am trying to merge the link files, when I found that I don't know how to generate a "ref.ctp.gz" or "refAndSamples.ctp.gz" file. All I got after running "thread" are "sample.ctp.gz".

Thanks you so much for the help!!

mcveanlab / mccortex

Best practice for large datasets? #92