ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

Potential Bug in compute_fraglen.py - Incorrect fraglength distribution #36

Closed mrmilhaven closed 2 years ago

mrmilhaven commented 2 years ago

Describe the bug Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py

To Reproduce

  1. Create NEAT dataset with fragment length distribution as mean = 300 and SD = 10 (test.fa is 1Mb of random nucleotides, no Ns): python ~/NEAT/gen_reads.py -R 100 -r test.fa -c 10 --pe 300 10 -o NEAT_test

2.Map reads (I used BWA): bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam

  1. Compute fragment distribution as sanity check: python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test

  2. open pickle file: import pandas pandas.read_pickle("NEAT_test.p") [[100], [1.0]]

Expected behavior I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.

Potential Reason I think the bug has to do with the count_frags method: First 9 fields in BAM from BAM file: NEAT_test-test-1 99 test 9039 60 100M = 9254 315

First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96): NEAT_test-test-1 99 0 9038 60 100M 0 9253 100

I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).

I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.

joshfactorial commented 2 years ago

I'm examining this code now. I see what you mean about the template length being different in pysam. I will work on this.

joshfactorial commented 2 years ago

I've made 2 changes that should improve this. One is I updated the code, as you suggested, to use pysam's template_length attribute. Second is I made the minimum read count to consider a read an input variable. It was previously hardcoded as 100. I set the default to 1 and you can adjust as needed. Add the flag '-m 100' to mimic previous behavior with this program.