Excessive coverage bias

dani-ture commented 3 months ago

Describe the bug Reads seem to not distribute uniformly across the genome. I expected some bias but there are long regions with coverage = 0 and others with coverage = 150, when the neat read-simulator was run with the default value of coverage = 10.

To Reproduce Steps to reproduce the behavior (early steps similar to what I described in issue #108):

Download the E. coli NCBI RefSeq assembly from the following link: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000005845.2/
Make a copy of the provided template config file (I called it test_config.yml) and set the parameters: ‘’’reference: GCF_000005845.2_ASM584v2_genomic.fna ploidy: 1 rng_seed: 6386514007882411’’’ The rest are left with the “.” as default.
Run neat on the command line:neat --log-name test --log-detail HIGH --log-level DEBUG read-simulator -c test_config.yml -o test
I ran a variant calling pipeline, which involved steps of: checking read quality (with fastqc), mapping the reads in the fastq.gz file to the ref (with bwa-mem2), getting a bam file, sorting it and indexing it (with samtools) and calling the variants (with bcftools).
I opened in igv (Integrative Genomics Viewer) the .sort.bam file to see how reads mapped to the reference.

Expected behavior I expected to see more or less evenly distributed reads.

Additional comments

Interestingly, the clusters of reads in regions with high coverage can reach a coverage up to 151 (the default read length).
I tried to run neat again with no_coverage_bias: true, but it seemed to have no influence on the output.
I ran it again changing the coverage to 11 and then to 12 (higher values like 20 took too much time to run) and I got similar outputs, although these clusters seem to appear mapped at different positions in the genome.
I ran it again setting the coverage to 1 and reads seem to distribute more uniformly in this case.
I ran it again with a .bed file targeting some regions, and targeted regions that originally had a coverage of 0 also have it in the targeted experiment.

Desktop (please complete the following information):

OS: Linux
Browser: Chrome
Version: 4.2.1

Additional context

I attached a video recording what I see on igv (sorry that it's a bit laggy). I uploaded two tracks: the upper one is the one with default parameters and the lower one is the one which has a coverage of 1.
If you want the pipeline I used (.sh file) or any of the generated files (bam, bai, vcf...) please let me know and I'll send them to you.

https://github.com/ncsa/NEAT/assets/131826966/8be5d905-737e-459e-953f-5b8c9137b78f

joshfactorial commented 3 months ago

I found a couple little bugs in one of the coverage functions, but I will double check to make sure I don't have a read_len variable where I meant to put coverage.

dani-ture commented 3 months ago

Thank you for the fast answer.

joshfactorial commented 3 months ago

I believe this is fixed. Please reopen this ticket if the issue persists.

ncsa / NEAT

Excessive coverage bias #113