Power analysis - Githubissues

aays commented 3 years ago

We need to do a 'true' power analysis for readcomb - we know it detects phase changes, but how often do we expect it to actually catch them in real data?

We've looked into ART (download page) as an existing accurate read simulator that also incorporates sequencing errors, but need to figure out a way to create recombinant haplotypes to draw read pairs from.

Here's what I'm thinking:

Use Rob's vcf2fasta script to create a FASTA with a sample chromosome segment for our two parents of interest. This will just be a full FASTA (that can be used as input to ART) which has the SNPs for each parent already incorporated - e.g.

>CC2935_chr1:1-100000
AGCGTACGTCGTACGTCAGTAGCAGCTATGC...
>CC2936_chr1:1-100000
AGCGTACGTCGTACGTCAGTAGCAGCTATGC...

Write code that creates recombinant haplotypes based on some known per bp probability of recombination c, which we'll define as just a switch in haplotype phase. I'm sure there's some intelligent way to draw from a distribution to do this, but the brute force way would be just to start from the first base and draw for an event that has a 0.004 (out of 1) chance of happening. If the event happens, we mark that position - let's say position 700. Then we create a recombinant sequence as follows:

import random
from Bio import SeqIO

c = 0.004
records = SeqIO.to_dict('parents_chr1_segment.fa', 'fasta') 

for i in range(1e6):
    if random.random() < c:
        phase_change_pos = i # 700, in this example

# I forget how exactly to_dict works but something like this - might have to coerce to string
recombinant_seq = records['CC2935'][0:phase_change_pos] + records['CC2936'][phase_change_pos:]

# the dirtiest method would be this
with open('art_input.fa', 'w') as f:
    f.write('>art_input,pos={}'.format(phase_change_pos) + '\n')
    f.write(str(recombinant_seq)) # might need another \n at the end?

and then write that combined sequence to file.

Use ART to draw reads from that combined sequence.
Run readcomb on the reads and see if we can detect the recombination event.

Once this has been figured out, the next step will be finding a way to scale this up so that we basically do this for n (where n = ~ 400) sequences at a time and see if we can use the recombination events to calculate a recombination rate, after which we see how close to 0.004 we got.

A true 'power analysis' would involve us keeping track of the exact number of false negatives (e.g. known phase changes) than we miss when we run readcomb (likely due to lack of SNP resolution) but I haven't thought far enough about how exactly to do that.

aays commented 3 years ago

Some ART tasks before we can do the above:

[x] Figure out how to create 2 x 250 reads
[ ] What insert sizes are common in 2 x 250 read datasets? Need to look at existing 2 x 250 bp data
[x] Figure out how they simulate sequencing errors and update with 2 x 250 (currently only supports up to 50 bp reads... woof)

aays commented 3 years ago

Other tasks

figure out 'fold of read coverage to be simulated for each amplicon'
figure out insertion and deletion rates to provide to ART (look into NovaSeq indel rates) - check original ART paper as well
install ART on server as well (use apt-get or equivalent)

jiyvliu commented 3 years ago

We're going to use two parts of the ART program: art_illumina and ART_profiler_illumina. ART_profiler_illumina will take in the fastq files generated by sequencing and generate an error profile that we will use for art_illumina read generation.

Installation on Linux:

sudo apt-get art-nextgen-simulation-tools

Relevant art_illumina arguements:

-f, --fcov vs -c, --rcount: we're going to use -f to indicate the amount of sequencing coverage for each base starting with 20 and eventually going up to 400
-ir, ir2, dr, dr2: TODO: look into figures for sequencing insertion and deletion rate of NovaSeq
-M, --cigarM: use this argument because we are following this instead of =/X format
-p, --paired vs -mp, --matepair: we're using these terms interchangably in readcomb but paired is the correct one and matepair is a different technology
-sam, --samout: script outputs sam files if you input sam files
-l, --len and -m, -mflen: -l indicates the length of single reads and -m is 2 times length plus average size of gap between paired reads
-ss, --seqSys vs -s --sdev: -ss uses a builtin sequencing error profile, we're going to use MSv1 for our 250bp reads prototyping and then -s once we generate our own profile
-na, --noALN: art_illumina doesn't output .aln files and only fastq and sam files

aays commented 3 years ago

Two first pass tests:

with coverage set to 400, compare detected phase changes (no need to classify just yet) against known phase changes
with coverage set to 1, compare detected phase changes against known phase changes
each of these should be done at least 3 times just to see variation between runs - especially the second!

jiyvliu commented 3 years ago

9.15cM/Mb = 0.01 crossovers/ ~109289.6 bases = 9.15e-8 crossovers/ base (1 centimorgan roughly corresponds to number of bases per 0.01 crossovers)

We got the correct number but just using incorrect calulations

ness-lab / recombinant-reads

Power analysis #18