mevers / NCI_ChIP-seq

Systematic optimization of parameters for ChIP-Seq peak calling algorithms using simulated short-read sequencing data
0 stars 1 forks source link

Paired-end reads #3

Open mevers opened 7 years ago

mevers commented 7 years ago

We simulate paired-end reads in the following way:

  1. Genomic start positions of fragments are generated using ChIPsim::sampleReads, where fragment start positions are sampled based on probabilities calculated from ChIPsim::bindDens2readDens. This is part of the standard ChIPsim workflow as described in the ChIPsim vignette, with the only difference that we refer to these positions as the fragment start positions, rather than the read start positions. Start positions are given for the positive and negative strand separately, and correspond to the 5' start of the fragment.

  2. For every fragment start position, we sample a fragment length from the fragment length distribution that was used to calculate the read probability density ChIPsim::bindDens2readDens(dens, fragLength, ...). We now have for every fragment on each strand a 5' start position and a fragment length.

  3. We then extract a left and right read sequence of length readLength for every fragment on every strand. We need to take care of the orientation and strand-origin of the left and right read pair.

    # For fragments from the positive strand
    #                                read2
    #                             <xxxxxx|
    # rev = 3' oooooooooooooooooooooooooooooooooooooooooooooo 5'
    # fragment    |----------------------|
    # fragstart   x
    # fwd = 5' oooooooooooooooooooooooooooooooooooooooooooooo 3'
    #             |xxxxxx>
    #             read1
    # For fragments from the negative strand
    #                                read1
    #                             <xxxxxx|
    # rev = 3' oooooooooooooooooooooooooooooooooooooooooooooo 5'
    # fragment    |----------------------|
    # fragstart                          x
    # fwd = 5' oooooooooooooooooooooooooooooooooooooooooooooo 3'
    #             |xxxxxx>
    #             read2
  4. We assume a uniform quality distribution (defined in function randomQualityPhred33 of scripts/simulate_reads.R), and write left and right reads to two gzipped fastq files using file1 <- gzfile(...) and cat(..., file = file1). This is done in real-time as we extract read sequences, to avoid storing the full set of reads in memory first and then writing it to two files.