Excluding regions with masked reference sequence when simulating reads

sbg / Mitty

Seven Bridges Genomics aligner/caller debugging and analysis tools

Apache License 2.0

13 stars 2 forks source link

Excluding regions with masked reference sequence when simulating reads #10

Open ivargr opened 5 years ago

ivargr commented 5 years ago

In your "Fast and accurate genomic analyses using genome graphs" paper you use Mitty to simulate reads, but only from regions without masked reference sequence (repetitive regions). I'm just wondering, is this possible to do directly with Mitty, e.g. by providing it with a hard-masked reference fasta file, or are there any ways to run Mitty telling it to not simulate reads from masked regions? Asking because I'm trying to reproduce the read simulations from your paper, and I don't find any details about how Mitty was used for the experiments shown in the paper.

In advance, thanks!

ghost commented 5 years ago

Hi @ivargr I did not do anything particularly clever when simulating. Mitty discards reads with 'N's in the reference sequence. We used hg19 which has 'N's at the start and 'N's at the Centromeres so those masked regions are discarded. The original file was from ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37/human_g1k_v37_decoy.fasta.gz (It's available as a public file on the SBG platform).

However, if you want to play with this aspect (there are so many things to play with!):

Since Mitty discards reads from regions with 'N's in them, you could hard mask the FASTA with 'N's according to your own criteria
You should be running filter-variants before generate-reads. To both programs you can pass a .bed file that restricts where the reads are sampled from, achieving a similar effect.

ivargr commented 5 years ago

Thanks for the quick reply! That makes sense and answers my question.

ivargr commented 5 years ago

Sorry, for re-opening this, but I have one more question. Would you be able to share the mitty command that you used to simulate reads in the "Fast and accurate genomic analyses using genome graphs" paper? I'm not really underestanding how errors are introduces into the reads when simulating. I assume mitty introduces "natural" sequencing-errors, like base pair substitutions? Can the error rate be specified when creating the read model or when running generate-reads?

Edit: I found out I can use the corrupt-reads command to introduce errors into the reads, but I'm still unsure how I can specify the error rate (e.g. if I want one substitution per 100 base pair on average), and I'm curious on what error rate was used in the simulations in the paper.

ghost commented 5 years ago

Hi @ivargr for the paper, for the "normal" simulations we used the hiseq-2500-v1-pcr-free built in model. Please read the Readme document for Mitty for a description of how read models are derived, what the built in read model characteristics are and how you can create your own read models. Thanks!