Closed nick-pestell closed 2 years ago
@nick-pestell I don't think I've added any support for circular contigs/genomes. I'd have to have a lot more information about the inputs and command line you're using. Thanks!
Thanks @nh13.
The command we're running is.
dwgsim -e 0 -E 0 -i -d 330 -N 144997 -1 150 -2 150 -r 0 -R 0 -X 0 -y 0.01 -H -z 1 <in.ref.fa> <out.prefix>
in.ref.fa is bacterial genome sequence of about 4Mbp to which we have introduced 16000 random SNPs using simuG.
If we assume a linear genome, which may not be correct for your application, then there are far fewer possible start positions for inserts that overlap the first base then the tenth base (only one for the former, and ten for the latter, and so on. I think that explains the reduced depths at the start and end of the contigs.
Ok, thanks @nh13 , I think that makes sense to me. Closing.
It would be great to have support for circular genomes (eg, bacterial chromosomes, plasmids, organelle genomes).
One approach might be to create multiple copies of the input FASTA, but taking 10Kb from the start and adding it to the end of the sequence. Then generating some proportion of the reads from each of the copies. Just a thought. There might be better ways of doing it.
Concatenate two copies, then discard a read pair if it is wholly contained in the second copy. Voila!
How to make the beginning and end of the sequence the same depth as the middle?
@AlsoATraveler you asked this as well in #81. Please read the above explanation. Locking the conversation as I believe over answered it.
Read depth appears to be as low as 0 at the start and end of the sequence and increases moving away from the ends.
This seems to be not representative of our real-world data. Perhaps this is because the genome is actually circular, thus reads wrap around the start/end?
Is it possible to simulate this behaviour with dwgsim?