s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

Order of input files to hts_SeqScreener changes hits reported when R1/R2 lengths differ #253

Open AstrobioMike opened 1 year ago

AstrobioMike commented 1 year ago

Describe the bug Heya, thanks for maintaining HTStream :)

For the attached example files, the number of hits identified by hts_SeqScreener vary depending on whether the forward reads or reverse reads were provided first.

Overall summary:

how-run num_hits
R1 SE: -U test-R1.fq.gz 7
R2 SE: -U test-R2.fq.gz 2021
R1,R2 PE: -1 test-R1.fq.gz -2 test-R2.fq.gz 3348
R2,R1 PE: -1 test-R2.fq.gz -2 test-R1.fq.gz 2021

For the data I was working with when running into this, attached, R1 length is 29 bases and R2 is 121. So the disparity may have to deal with the length of read 1 and if that's used for calculating the percentage-hit cutoff without consideration for read 2 length/total fragment length?

To Reproduce Example sequence reads that will help reproduce the bug:

ref.fa.gz test-R1.fq.gz test-R2.fq.gz

Commands to reproduce the behavior:

mamba create -y -n hts -c conda-forge -c bioconda -c defaults htstream=1.3.3

conda activate hts

### with files attached to issue

## running single
# R1
hts_SeqScreener -U test-R1.fq.gz -s ref.fa.gz -L hts-R1-SE.log > /dev/null
grep -A 5 "Single_end" hts-R1-SE.log | tail -n 1 | sed 's/^/# /'
#         "hits": 7

# R2
hts_SeqScreener -U test-R2.fq.gz -s ref.fa.gz -L hts-R2-SE.log > /dev/null
grep -A 5 "Single_end" hts-R2-SE.log | tail -n 1 | sed 's/^/# /'
#         "hits": 2021

## running together
# R1 first
hts_SeqScreener -1 test-R1.fq.gz -2 test-R2.fq.gz -s ref.fa.gz -C -L hts-R1-first.log > /dev/null
grep -A 3 "Paired_end" hts-R1-first.log | tail -n 1
    # 3348 hits

# R2 first
hts_SeqScreener -1 test-R2.fq.gz -2 test-R1.fq.gz -s ref.fa.gz -C -L hts-R2-first.log > /dev/null
grep -A 3 "Paired_end" hts-R2-first.log | tail -n 1
    # 2021 hits

Expected behavior I would have expected the results to be the same whether R1 or R2 were provided first when given as paired with the -C option set. But the order provided makes a large difference.

Desktop (please complete the following information):