tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
GNU General Public License v3.0
127 stars 28 forks source link

running SNP-dists on large sample set #51

Open DorothyTamYiLing opened 1 year ago

DorothyTamYiLing commented 1 year ago

Hi Teesmann,

First of all, thanks for writing this piece of software.

I am trying to run SNP-dists on a large sample set (3745 samples alignment, each with 4988504b). It has been running for more than 24 hours and I wonder if that is normal. How much time do you think it will take to finish for an input of this size? I have stopped the running now as I would like to get a rough estimate of the run time.

Thanks, Dorothy

kloetzl commented 1 year ago

Hi Dorothy, The runtime of snp-dists scales quadratically with the input. Say c is the time for a single pairwise comparison. snpsdist makes O(n^2) comparisons. Hence for your sample the time is 3745^2 * c. If c is 10ms that still is 38 hours! In order to get a good estimate for c I recommend you run the analysis on just 37 samples. Multiply the resulting time by 10'000 and you get the runtime for the whole dataset.

If the resulting estimate is way too large you can compute approximate solutions using mash or phylonium.

Hope this helps, Fabian

DorothyTamYiLing commented 1 year ago

Hi Fabian,

Thanks for the useful tips! I will give the calculation a go and maybe try to reduce the sample set too.

Thanks, Dorothy