Contamination issue - Githubissues

jsgounot commented 2 years ago

Hi Chenhao,

we discovered an issue with the contamination pipeline from this repo. Shortly, BWA default seed (k=19) allows 'random' alignments of bacterial reads against the human genome. This issue can be important when insert size is low with high reads overlap. I have some samples with more than 50% of raw reads miss-classified as human with this. I can send you an email with slides for more info if you want to.

Since some people in the lab keep using your pipeline for their analyses, I think it would be a nice idea to fix the issue in your original repo, update the CSB5 one once done. Issue is I don't know much about nextflow, I'm only using snakemake.

To fix this, I made a very simple python script to filter reads considered as human based on (subjective) 70% coverage and 70% identity. Based on my analysis, this remove most of the potential false positive.

# -*- coding: utf-8 -*-
# @Author: jsgounot
# @Date:   2022-06-28 17:38:49
# @Last Modified by:   jsgounot
# @Last Modified time: 2022-09-30 20:08:43

import argparse
import pysam
import sys

parser = argparse.ArgumentParser()
parser.add_argument("--min_coverage", dest="min_coverage", default=0.7, type=float, help="a minimum read coverage [0,1] (default: 0.8)", required=False)
parser.add_argument("--min_identity", dest="min_identity", default=0.7, type=float, help="a minimum alignement identity [0,1] (default: 0.9)", required=False)
parser.add_argument("--input", dest="input", default="-", type=str, help="input file (default: stdin)", required=False)
parser.add_argument("--output", dest="output", default="-", type=str, help="output file (default: stdout)", required=False)

args = parser.parse_args()

bamfile = pysam.AlignmentFile(args.input, "rb")
outfile = pysam.AlignmentFile(args.output, "wb", template=bamfile)

fixes = 0
identity_threshold = 1 - args.min_identity

for idx, read in enumerate(bamfile.fetch(until_eof=True)):
    correctly_mapped = False

    if read.infer_query_length() != None and read.is_proper_pair:
        len_ops, num_ops = read.get_cigar_stats()

        # (M + I + D / read_length) and NM / (M + I + D)
        mid = sum(len_ops[0:3])

        if (mid / read.infer_read_length() >= args.min_coverage) and (len_ops[10] / mid <= identity_threshold):
            correctly_mapped = True

    if not correctly_mapped:
        read.is_mapped = False
        read.is_proper_pair = False
        read.is_unmapped = True

        # This one is tricky but I need it for flag 12
        read.mate_is_unmapped = True
        fixes += 1

    outfile.write(read)

bamfile.close()
outfile.close()

sys.stderr.write('filter done with %i corrections out of %i reads\n' %(fixes, idx))

For speed purpose, I did not implemented a second step check where I check if the mate is unmapped or not. But this can be fixed with samtools fastq using the -s option. Here is my snakemake shell line:

fastp -i {input.r1} -I {input.r2} --stdout -j {output.js} -h {output.ht} | bwa mem -p -k 19 -t {threads} {input.human_ref} - | python {params.pyscript} | samtools fastq -f12 -F256 -1  {output.o1} -2 {output.o2} -s {output.os} -

Do you think you could quickly update your repo to reflect those changes? Let me know if you're too busy and I will try to see how to do it on my own.

Regards, JS

lch14forever commented 2 years ago

Hi Sebastien,

Very informative comment. I think this can be easily fixed.

The script will go into the bin/ folder (I will create), making it callable by the pipeline.
pysam has to be added to the container docker file docker/Dockerfile.decont

The current pipeline uses samtools v1.9 so might need to figure out a compatible pysam version.

I can take a stab fixing it and I might need your help to test.

Chenhao

lch14forever commented 2 years ago

pushed to my own folk here. could you help to test?

jsgounot commented 2 years ago

Hey,

I've some difficulties running the script on my AWS instance, looks like the pipeline is not functional with last nextflow versions. I've asked someone who used the older version on NSCC to test it for me, hopefully this should work. If you want to quickly try on your side, you can use this sample. Previous version should remove more than 60% of the reads while the new one should remove less than 1%. I've been using the last T2T reference genome (GCF_009914755.1) but there should not be a lot of variation if you keep using hg19.

Best, JS

jsgounot commented 2 years ago

Hi Chenhao,

it looks like your pipeline can't be used with the last version of nextflow and we have issue installing the last pysam version on NSCC. Would you be able to update your nextflow script to make it work with recent nextflow version?

Regards, JS

lch14forever commented 2 years ago

Hi,

I need to know the exact issue because I have no problem with nextflow version.

jsgounot commented 2 years ago

Hey,

using the last nextflow version (22.10.1).

First I had to change the tuple declarations:

ubuntu$rm chenhao_new/pipeline_info/shotgunmetagenomics-nf_trace.txt && ~/Software/nextflow ~/Software/shotgunmetagenomics-nf/main.nf --outdir chenhao_new --reads /mnt/volume1/Projects/decont/data/MBS1230.*.fq.gz --decont_refpath GRCH38/ --decont_index grch38.ref.fasta.gz --profilers ''
N E X T F L O W  ~  version 22.10.1
Launching `/home/ubuntu/Software/shotgunmetagenomics-nf/main.nf` [pensive_mcclintock] DSL2 - revision: 58cb5d678c
Unqualified output value declaration has been deprecated - replace `tuple prefix,..` with `tuple val(prefix),..`

 -- Check script '/home/ubuntu/Software/shotgunmetagenomics-nf/./modules/decont.nf' at line: 17 or see '.nextflow.log' file for more details

But even with this I hit an issue:

ubuntu$ rm chenhao_new/pipeline_info/shotgunmetagenomics-nf_trace.txt && ~/Software/nextflow ~/Software/shotgunmetagenomics-nf/main.nf --outdir chenhao_new --reads /mnt/volume1/Projects/decont/data/MBS1230.*.fq.gz --decont_refpath GRCH38/ --decont_index grch38.ref.fasta.gz --profilers ''
N E X T F L O W  ~  version 22.10.1
Launching `/home/ubuntu/Software/shotgunmetagenomics-nf/main.nf` [special_meucci] DSL2 - revision: 58cb5d678c
[-        ] process > DECONT -
[-        ] process > DECONT [  0%] 0 of 1
[6f/4de2fb] process > DECONT (MBS1230) [100%] 1 of 1, failed: 1
[6f/4de2fb] process > DECONT (MBS1230) [100%] 1 of 1, failed: 1
[6f/4de2fb] process > DECONT (MBS1230) [100%] 1 of 1, failed: 1
Execution cancelled -- Finishing pending tasks before exit
WARN: Input tuple does not match input set cardinality declared by process `DECONT` -- offending value: [MBS1230, /mnt/volume1/Projects/decont/data/MBS1230.1.fq.gz]
WARN: Failed to render execution report -- see the log file for details
WARN: Failed to render execution timeline -- see the log file for details
WARN: Graphviz is required to render the execution DAG in the given format -- See http://www.graphviz.org for more info.
Error executing process > 'DECONT (MBS1230)'

Caused by:
  Process requirement exceeds available CPUs -- req: 12; avail: 2

Command executed:

  fastp -i MBS1230.1.fq.gz -I input.3 --stdout -j MBS1230.json -h MBS1230.html | \
  bwa mem -k 19 -p -t 12 GRCH38/grch38.ref.fasta.gz - | \
  decont_filter.py | \
  samtools fastq -f12 -F256  -1  MBS1230_fastpdecont_1.fastq.gz -2 MBS1230_fastpdecont_2.fastq.gz -s MBS1230_fastpdecont_single.fastq.gz  -

Command exit status:
  -

Command output:
  (empty)

Work dir:
  /mnt/volume1/Projects/decont/work/6f/4de2fbf823a804264966a24ef02fe7

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

I could try to downgrade my version to yours but I'm not sure it's the best.

Thanks, JS

jsgounot commented 2 years ago

Nevermind, I'm able to run the pipeline now, I will update you once I've results.

jsgounot commented 2 years ago

Ok I confirm it's working, but I still had to modify the tuple information for the last nextflow version (it's a minor fix though).

If someone needs to quickly check again:

Using ERR7671938 reads and grch38 reference genome
Raw fastq file has 37,947,108 unique reads ID
Previous version generated fastq files with 17,955,856 unique reads ID (47%)
New version generated fastq files with 37,369,661 unique reads ID (98%)

Using this command:

zcat ERR7671938_fastpdecont_1.fastq.gz | awk 'NR % 4 == 1' | sort -u | wc -l

Values may vary a bit depending of your reference sequence.

Thank you Chenhao. JS

lch14forever commented 2 years ago

Thank you for the feedback Sebastien. I will work out an updated pipeline for better compatibility.

lch14forever commented 1 year ago

Hi Sebastien,

I recently encountered a huge loss of reads using the same approach here comparing to kneaddata. I wonder what insert size you got for your sample. The ones I tested has insert size 200-350.

Chenhao

jsgounot commented 1 year ago

Hey,

I tested multiple samples with different insert sizes. Usually the insert size was ~ 300bp, but sometimes lower. With 150bp paired-end reads, this means very low or even negative inner distance. The negative inner distance (meaning overlap between the two pairs) is not the primary source of the error, it just makes it worst. The main issue is that with default BWA parameters, a 19bp seed is not hard to find against a human genome.

JS

lch14forever commented 1 year ago

I found this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7478626/pdf/mgen-6-393.pdf

Do you think we should just use bowtie2 for this purpose?

lch14forever commented 1 year ago

I have a quick experiment:

file                                          format  type  num_seqs   sum_len      min_len  avg_len  max_len
adapterTrimmed.1_kneaddata_paired_1.fastq.gz  FASTQ   DNA   6,989,603  640,413,464  50       91.6     135
test.bowtie2_fastpdecont_1.fastq.gz           FASTQ   DNA   6,591,521  527,584,547  21       80       135
test.bowtie2.m38_fastpdecont_1.fastq.gz       FASTQ   DNA   5,534,518  438,489,518  21       79.2     135
test_fastpdecont_1.fastq.gz                   FASTQ   DNA   414,462    39,252,775   15       94.7     135

These 4 rows correspond to using Kneaddata (customized bowtie2), fastp+bowtie2+samtools+kneaddata database, fastp+bowtie2+samtools+GRCm38 database, fastp+bwa+samtools+GRCm38.

This dataset is low biomass sample with lots of host DNA. I previously took the reads left from kneaddata and assembled them. However, many of the resulting contigs mapped to the host... So I think while BWA is too stringent, bowtie2 is too lenient?

jsgounot commented 1 year ago

With Niranjan, we end up agreeing using BWA with a more stringent filtering directly on the bam file. I don't think one mapper is very different than the other in this case. While they do have different approaches, the main problem I've seen is the difference of alignment encoding, with BWA being more lenient to accept a pair as properly paired while Bowtie and minimap2 look more stringent (and correct). Rather than which mapper you're using, it's more which options you're using and how you post-process your alignment which matters. We sometimes forget that mapper were originally designed for isolate and (human) population genomic purposes, and not decontamination or multi-mapping.

But maybe we overthink this a bit too much. Just increasing BWA seed length was enough to fix most of the issue:

It seems that kneaddata is also using bowtie2. That being said, keep me updated if you find something new, I'm interested by the subject. Thanks!

JS

jsgounot commented 7 months ago

Hey Chenhao, somehow I found out that the new version is not behaving correctly with some samples, leaking out some host reads during the decontamination process. Until I find the reason why, I think it would be safer to revert to the previous version.

lch14forever / shotgunmetagenomics-nf

Contamination issue #7