Strange result when trying to reproduce article data

Prunoideae commented 4 years ago

Dear Meng, I'm testing the performance and result quality of MitoZ, but recently I found a strange result (SRA) which only have 3kbps, and missing much of the PCGs which should be annotated. SRR955308 circos The run summary is summary.txt, and the run log is SRR955308.runtime.log

Strangely, this works in my workflow:

SRR955308.log

All genes are annotated and done.

I guess this is because that I used 80 threads in the run, which triggered some data-racing problem in the de novo assembly. This could be potentially dangerous for servers that are running MitoZ on highly paralleled systems for speed boost. And this is quite hard for me to do a "fair" benchmark between MitoZ and my workflow, since time consumption is also considered and discussed...

Also, I have a question about the run mode 3: How am I supposed to make use of multi-Kmer mode if I'm assembling a new species but with poor result in quick mode? If I don't have such a sequence of the missing gene, does it mean MitoZ can't do anything to improve the result?

Prunoideae commented 4 years ago

It tested out to be the high thread number causing the assemble process stuck or losing too much k-mer when assembling, since randomness occurred in several rerun, made MitoZ outputting different result under same dataset and same parameters. Maybe the SOAPdenovo-Trans is not designed for completely thread-safe, and problems like data-racing, thread-competition will occur more frequently as the thread number increase, maybe giving threads like 8 or 16 is fine, but the collide chance will increase to a ridiculous level if a 80 threads number is specified.

linzhi2013 commented 4 years ago

Hi Prunoideae,

Sorry for my late reply, I have been busy recently.

It's true that the dataset used has effect on results when using MitoZ. To my knowledge, different sizes of input data might have different results. Generally, the larger dataset can lead to more complete mitogenome. but in rare case, fewer data can get better results.

Also, I have a question about the run mode 3: How am I supposed to make use of multi-Kmer mode if I'm assembling a new species but with poor result in quick mode? If I don't have such a sequence of the missing gene, does it mean MitoZ can't do anything to improve the result?

the missing genes correspond to PCGs. Therefore, if there are no missing PCGs, then multi-kmer mode may have no improvement. But, you can always presume some "missing PCGs" (even if they already in your quick mode results), with that provided, MitoZ can run the multi-kmer mode.

If the users want to try other kmers, they can invoke the mitoAssemble command directly, to assemble the nucl+mito scaffolds, and finally, use the findmitoscaf command to search for the mitogenomes.

Cheers Guanliang

Prunoideae commented 4 years ago

Hi Prunoideae,

Sorry for my late reply, I have been busy recently.

It's true that the dataset used has effect on results when using MitoZ. To my knowledge, different sizes of input data might have different results. Generally, the larger dataset can lead to more complete mitogenome. but in rare case, fewer data can get better results.

Also, I have a question about the run mode 3: How am I supposed to make use of multi-Kmer mode if I'm assembling a new species but with poor result in quick mode? If I don't have such a sequence of the missing gene, does it mean MitoZ can't do anything to improve the result?

the missing genes correspond to PCGs. Therefore, if there are no missing PCGs, then multi-kmer mode may have no improvement. But, you can always presume some "missing PCGs" (even if they already in your quick mode results), with that provided, MitoZ can run the multi-kmer mode.

If the users want to try other kmers, they can invoke the mitoAssemble command directly, to assemble the nucl+mito scaffolds, and finally, use the findmitoscaf command to search for the mitogenomes.

Cheers Guanliang

Dear Meng, Thank you for your reply, but I think I've downloaded the complete dataset and extracted correctly from SRA, so there should not be a problem about the size or integrity of the data, since I used the same archive along the way. A further test using low thread number (8) outputted a correct result, so I guess it's because the assembler just can't accept many threads.

linzhi2013 / MitoZ

Strange result when trying to reproduce article data #60