Running mashtree with large number of genomes

seansolari commented 3 years ago

Hi there!

I'm struggling to get mashtree to complete running on a dataset of 29611 genomes (371 Archaea, 19238 Bacteria, 10002 Virus) totalling just shy of 80 Gb in sequence data. Running mashtree with 8 threads, on a HPC with the job having 600Gb of RAM allocated, it seems to complete the sketches and distance databasing in a reasonable amount of time (between 24-36 hours), however I've not managed to get it past the following stage:

mashtree: mashDistance: Converting to phylip format into /tmp/MASHTREE.9kWFcb/distances.phylip

For the remaining time (up to the time limit of 96 hours), it can't seem to get past this step. I was wondering if you might have any advice on how to get mashtree working on this dataset, if this is expected behaviour or maybe I need to allocate more resources to the job?

I'm running mashtree v1.2.0, installed on the Linux HPC via cloning the github repo. Any help would be greatly appreciated!

hmontenegro commented 3 years ago

Probably a duplicate of #55 .

lskatz commented 3 years ago

Yep probably a duplicate. I'll start looking into the issue on #55 .

lskatz / mashtree

Running mashtree with large number of genomes #58