Open ivargr opened 5 years ago
Hi @ivargr I did not do anything particularly clever when simulating. Mitty discards reads with 'N's in the reference sequence. We used hg19 which has 'N's at the start and 'N's at the Centromeres so those masked regions are discarded. The original file was from ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37/human_g1k_v37_decoy.fasta.gz
(It's available as a public file on the SBG platform).
However, if you want to play with this aspect (there are so many things to play with!):
filter-variants
before generate-reads
. To both programs you can pass a .bed
file that restricts where the reads are sampled from, achieving a similar effect. Thanks for the quick reply! That makes sense and answers my question.
Sorry, for re-opening this, but I have one more question. Would you be able to share the mitty command that you used to simulate reads in the "Fast and accurate genomic analyses using genome graphs" paper? I'm not really underestanding how errors are introduces into the reads when simulating. I assume mitty introduces "natural" sequencing-errors, like base pair substitutions? Can the error rate be specified when creating the read model or when running generate-reads
?
Edit: I found out I can use the corrupt-reads command to introduce errors into the reads, but I'm still unsure how I can specify the error rate (e.g. if I want one substitution per 100 base pair on average), and I'm curious on what error rate was used in the simulations in the paper.
Hi @ivargr for the paper, for the "normal" simulations we used the hiseq-2500-v1-pcr-free
built in model. Please read the Readme document for Mitty for a description of how read models are derived, what the built in read model characteristics are and how you can create your own read models. Thanks!
In your "Fast and accurate genomic analyses using genome graphs" paper you use Mitty to simulate reads, but only from regions without masked reference sequence (repetitive regions). I'm just wondering, is this possible to do directly with Mitty, e.g. by providing it with a hard-masked reference fasta file, or are there any ways to run Mitty telling it to not simulate reads from masked regions? Asking because I'm trying to reproduce the read simulations from your paper, and I don't find any details about how Mitty was used for the experiments shown in the paper.
In advance, thanks!