real reads versus simulated reads

fengyuchengdu commented 4 years ago

Hi Torsten,

Lots of bacterial genomes are lacking SRA data, preventing us from performing several reads-based analysis. I think you mentioned somewhere that "best reads are contigs", and do you reckon simulated reads generated from the assembly (perfect reads, no error, perfectly even coverage) should be used in Snippy even if we do have the real reads.

I've run some tests to see if they are significantly different (simulated reads were generated from Shovill-assembled genome with the same read length and coverage as that of the real reads) and the answer is "yes". I found more variants including SNP were detected by Snippy using simulated reads compared with using the real reads. So it makes me wondering which one is closer to the truth.

Thanks

Yu

tseemann commented 4 years ago

Shredding draft genomes can be a problem yes. Did you do it yourself, or use snippy --ctgs ?

fengyuchengdu commented 4 years ago

I tried both snippy --ctgs and wgsim wrapper readsimulator.py (https://github.com/wanyuac/readSimulator) command to generate perfect reads: readSimulator.py --input input.fasta --simulator wgsim --simulator_path /home/zong/anaconda3/envs/py36/bin/wgsim --depth 100 --outdir simulated_reads --readlen 150 --opts '-e 0 -d 350 -r 0 -R 0 -X 0 -h -S 0'

the number of variants snippy --ctgs <= using real reads <= wgsim-shredded reads

tseemann / snippy

real reads versus simulated reads #398