Closed mrmilhaven closed 2 years ago
I'm examining this code now. I see what you mean about the template length being different in pysam. I will work on this.
I've made 2 changes that should improve this. One is I updated the code, as you suggested, to use pysam's template_length attribute. Second is I made the minimum read count to consider a read an input variable. It was previously hardcoded as 100. I set the default to 1 and you can adjust as needed. Add the flag '-m 100' to mimic previous behavior with this program.
Describe the bug Fragment Length Distribution is incorrect, even when using NEAT's gen_reads.py
To Reproduce
2.Map reads (I used BWA): bwa mem -M test.fa NEAT_test_read1.fq.gz NEAT_test_read2.fq.gz > NEAT_test.bam
Compute fragment distribution as sanity check: python ~/NEAT/utilities/compute_fraglen.py -i NEAT_test.bam -o NEAT_test
open pickle file: import pandas pandas.read_pickle("NEAT_test.p") [[100], [1.0]]
Expected behavior I would expect the same distribution that I gave gen_reads.py (mean of 300 with SD of 10) to come back out.
Potential Reason I think the bug has to do with the count_frags method: First 9 fields in BAM from BAM file: NEAT_test-test-1 99 test 9039 60 100M = 9254 315
First 9 fields of BAM from script(adding print line within count_frags for loop, approx. line 96): NEAT_test-test-1 99 0 9038 60 100M 0 9253 100
I believe that converting from an iterable pysam type to a string (line 95) causes some of the fields to change. This also implies that , since mate mapping(field 8) is always equal to the reference (field 3), every read will look like it was mate mapped, per your code (line 108).
I think this can be fixed by using pysams inherent parameters, such as item.mate_is_unmapped or item.template_length instead of splitting the line by tabs.