ness-lab / recombinant-reads

1 stars 0 forks source link

Power analysis #18

Open aays opened 3 years ago

aays commented 3 years ago

We need to do a 'true' power analysis for readcomb - we know it detects phase changes, but how often do we expect it to actually catch them in real data?

We've looked into ART (download page) as an existing accurate read simulator that also incorporates sequencing errors, but need to figure out a way to create recombinant haplotypes to draw read pairs from.

Here's what I'm thinking:

  1. Use Rob's vcf2fasta script to create a FASTA with a sample chromosome segment for our two parents of interest. This will just be a full FASTA (that can be used as input to ART) which has the SNPs for each parent already incorporated - e.g.
>CC2935_chr1:1-100000
AGCGTACGTCGTACGTCAGTAGCAGCTATGC...
>CC2936_chr1:1-100000
AGCGTACGTCGTACGTCAGTAGCAGCTATGC...
  1. Write code that creates recombinant haplotypes based on some known per bp probability of recombination c, which we'll define as just a switch in haplotype phase. I'm sure there's some intelligent way to draw from a distribution to do this, but the brute force way would be just to start from the first base and draw for an event that has a 0.004 (out of 1) chance of happening. If the event happens, we mark that position - let's say position 700. Then we create a recombinant sequence as follows:
import random
from Bio import SeqIO

c = 0.004
records = SeqIO.to_dict('parents_chr1_segment.fa', 'fasta') 

for i in range(1e6):
    if random.random() < c:
        phase_change_pos = i # 700, in this example

# I forget how exactly to_dict works but something like this - might have to coerce to string
recombinant_seq = records['CC2935'][0:phase_change_pos] + records['CC2936'][phase_change_pos:]

# the dirtiest method would be this
with open('art_input.fa', 'w') as f:
    f.write('>art_input,pos={}'.format(phase_change_pos) + '\n')
    f.write(str(recombinant_seq)) # might need another \n at the end?

and then write that combined sequence to file.

  1. Use ART to draw reads from that combined sequence.
  2. Run readcomb on the reads and see if we can detect the recombination event.

Once this has been figured out, the next step will be finding a way to scale this up so that we basically do this for n (where n = ~ 400) sequences at a time and see if we can use the recombination events to calculate a recombination rate, after which we see how close to 0.004 we got.

A true 'power analysis' would involve us keeping track of the exact number of false negatives (e.g. known phase changes) than we miss when we run readcomb (likely due to lack of SNP resolution) but I haven't thought far enough about how exactly to do that.

aays commented 3 years ago

Some ART tasks before we can do the above:

aays commented 3 years ago

Other tasks

jiyvliu commented 3 years ago

We're going to use two parts of the ART program: art_illumina and ART_profiler_illumina. ART_profiler_illumina will take in the fastq files generated by sequencing and generate an error profile that we will use for art_illumina read generation.

Installation on Linux:

Relevant art_illumina arguements:

aays commented 3 years ago

Two first pass tests:

jiyvliu commented 3 years ago

9.15cM/Mb = 0.01 crossovers/ ~109289.6 bases = 9.15e-8 crossovers/ base (1 centimorgan roughly corresponds to number of bases per 0.01 crossovers)

We got the correct number but just using incorrect calulations