lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
478 stars 132 forks source link

biostar154220 removes too many reads when using with a bed file #200

Closed sbrohee closed 2 years ago

sbrohee commented 2 years ago

Subject of the issue

Hello. When using the great tool biostar154220 (allowing to remove reads above a given coverage) with option --region and a bed file, I obtain a really small BAM (~about 100 reads while I start from more than 1 million reads) and aim a max coverage of 20.

I found a workaround by piping the results into samtools (so I am not in such a hurry) but am I missing something?

Thanks for this great tool by the way.

Your environment

Steps to reproduce

Using biostar154220 java -jar dist/sortsamrefname.jar --samoutputformat bam mybam.bam | java -jar dist/biostar154220.jar -n 600 --keep-unmapped --samoutputformat bam --regions mybed.bed | samtools view | wc

...
    96    1634   46131

Using samtools java -jar dist/sortsamrefname.jar --samoutputformat bam mybam.bam | java -jar dist/biostar154220.jar -n 600 --keep-unmapped --samoutputformat bam| samtools view -L mybed.bed | wc

...
 651918 11106749 321609359

Expected behaviour

Both files should contain the same number of reads (approximatively).

Actual behaviour

Too much reads are filtered when using --regions option of biostar154220

lindenb commented 2 years ago

@sbrohee thank you for pointing out this problem. when using the option '--region' it looks like there is a problem in the IntervalFilter class of the htsjdk or I'm using it badly (?). If hope I fixed the problem using another strategy.

https://github.com/lindenb/jvarkit/commit/d29b24f2b39a2a46f7d056f5e7605eff8389b5e9

sbrohee commented 2 years ago

Hi Pierre,

I restarted the command with the last version on the git and it delivered the expected results

-> with samtools : 651918 reads in the output bam -> with jvarkit : 651923 reads in the output bam

Many thanks and thank you for your fast answer.

Sylvain