Open aays opened 3 years ago
Some ART tasks before we can do the above:
Other tasks
We're going to use two parts of the ART program: art_illumina
and ART_profiler_illumina
. ART_profiler_illumina
will take in the fastq files generated by sequencing and generate an error profile that we will use for art_illumina
read generation.
Installation on Linux:
sudo apt-get art-nextgen-simulation-tools
Relevant art_illumina
arguements:
-f, --fcov
vs -c, --rcount
: we're going to use -f
to indicate the amount of sequencing coverage for each base starting with 20 and eventually going up to 400-ir, ir2, dr, dr2
: TODO: look into figures for sequencing insertion and deletion rate of NovaSeq-M, --cigarM
: use this argument because we are following this instead of =/X format-p, --paired
vs -mp, --matepair
: we're using these terms interchangably in readcomb but paired is the correct one and matepair is a different technology-sam, --samout
: script outputs sam files if you input sam files-l, --len
and -m, -mflen
: -l
indicates the length of single reads and -m
is 2 times length plus average size of gap between paired reads-ss, --seqSys
vs -s --sdev
: -ss
uses a builtin sequencing error profile, we're going to use MSv1 for our 250bp reads prototyping and then -s
once we generate our own profile-na, --noALN
: art_illumina doesn't output .aln files and only fastq and sam filesTwo first pass tests:
9.15cM/Mb = 0.01 crossovers/ ~109289.6 bases = 9.15e-8 crossovers/ base (1 centimorgan roughly corresponds to number of bases per 0.01 crossovers)
We got the correct number but just using incorrect calulations
We need to do a 'true' power analysis for readcomb - we know it detects phase changes, but how often do we expect it to actually catch them in real data?
We've looked into ART (download page) as an existing accurate read simulator that also incorporates sequencing errors, but need to figure out a way to create recombinant haplotypes to draw read pairs from.
Here's what I'm thinking:
vcf2fasta
script to create a FASTA with a sample chromosome segment for our two parents of interest. This will just be a full FASTA (that can be used as input to ART) which has the SNPs for each parent already incorporated - e.g.and then write that combined sequence to file.
Once this has been figured out, the next step will be finding a way to scale this up so that we basically do this for n (where n = ~ 400) sequences at a time and see if we can use the recombination events to calculate a recombination rate, after which we see how close to 0.004 we got.
A true 'power analysis' would involve us keeping track of the exact number of false negatives (e.g. known phase changes) than we miss when we run readcomb (likely due to lack of SNP resolution) but I haven't thought far enough about how exactly to do that.