Different results on repeated runs

donkirkby commented 4 years ago

Thanks for publishing IVA, it's been really helpful in our project. However, when I try to assemble the same reads multiple times, I sometimes get different results. Here's a small example that tries to assemble 200 read pairs that were randomly generated from the first 2000 bases of HIV's pol gene. The test script assembles them 20 times, and prints the location of each contig within the pol gene. You can see that the contig sizes and locations change from run to run.

$ sudo docker run -it --rm sangerpathogens/iva
root@8d1ab295e484:/# wget -qq https://raw.githubusercontent.com/cfe-lab/MiCall/55f622e236286f2d003b255bc11a8b1080168335/micall/tests/microtest/2180A-HIV_S22_L001_R1_001.fastq https://raw.githubusercontent.com/cfe-lab/MiCall/55f622e236286f2d003b255bc11a8b1080168335/micall/tests/microtest/2180A-HIV_S22_L001_R2_001.fastq https://raw.githubusercontent.com/cfe-lab/MiCall/55f622e236286f2d003b255bc11a8b1080168335/micall/tests/repeat_iva.py
root@8d1ab295e484:/# python3 repeat_iva.py 
contig.00001: 125-1080, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1080, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1080, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1080, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1080, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1080, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1080, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1071, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1071, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1080, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1080, contig.00002: 1031-1474, contig.00003: 1447-1806
contig.00001: 125-1071, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1080, contig.00002: 1383-1917, contig.00003: 1031-1474
contig.00001: 125-1071, contig.00002: 1383-1917, contig.00003: 1031-1474

I traced through the IVA code, and it looks like the problem is inconsistent results from the smalt mapper. I reported the problem to the smalt developers, but I haven't heard back yet.

In our project, we've worked around the problem by switching from smalt to bowtie2, so I'll create a pull request with our bowtie2 version.

martinghunt commented 4 years ago

Thanks for flagging this and the pull request. I'm a bit wary of just swapping out the default mapper, because it's such a fundamental part of the assembler.

Did you try the -r option of smalt to set the random seed to the same value for each run (like -r 42 added to the smalt command)? I think that should give the same results every time.

I'm guessing you don't get worse assemblies using bowtie2? Do you know how it affected the run time? I vaguely remember trying bowtie2 when I was developing IVA and finding smalt's speed was enough to make me use it instead of bowtie2, but bowtie2 will no doubt have got faster since then.

donkirkby commented 4 years ago

I understand that it's a big change to swap out the mapper, and I had not tried the -r option. My first tests with -r 42 and -r 1 both worked, but I was a bit surprised that -r 0 didn't. Looking at smalt map -H, though, I see this:

-r <seed [INT]> If <seed> >= 0 report an alignment selected at random where there are multiple mappings with the same best alignment score. With <seed> = 0 (default) a seed is derived from the current calendar time. If <seed> < 0 reads with multiple best mappings are reported as 'not mapped'.

So thanks for suggesting the -r option, that looks like a much smaller change to solve the core stability problem. I'm still curious about bowtie2, though, because it seems to assemble longer contigs. It is slower, though not necessarily because of the mapping performance. I think it may just be running many more iterations than smalt did.

I'll close the bowtie2 pull request for now, and create a new one for the -r1 change. I'll let you know if I get more definitive results for bowtie2 or another mapper like bwa.

martinghunt commented 4 years ago

Thanks for all this - it's very useful! Interesting that bowtie2 is producing longer contigs. I think a command line option to choose bowtie2 instead of smalt might be a good idea. Sounds like it's a speed vs accuracy trade off.

donkirkby commented 4 years ago

After more testing, the -r 1 option is still working well in most scenarios. I was a little worried that using multiple threads would make the random numbers unpredictable again, but I only see a problem with messy reads where I didn't trim the adapters. If I have a small sample with 800 read pairs, and I run IVA with two threads on the original reads, I get a variety of different contigs on repeated runs of IVA. However, if I run it with one thread, I get the same results every time. If I trim the adapters, then I get better results, and the same results every time, even when I use two threads.

I guess the conclusion is to use one thread to make sure it's repeatable, but two threads haven't been a problem yet for trimmed samples.

sanger-pathogens / iva

Different results on repeated runs #93