Open jsgounot opened 2 years ago
Hi Sebastien,
Very informative comment. I think this can be easily fixed.
bin/
folder (I will create), making it callable by the pipeline. pysam
has to be added to the container docker file docker/Dockerfile.decont
The current pipeline uses samtools v1.9 so might need to figure out a compatible pysam version.
I can take a stab fixing it and I might need your help to test.
Chenhao
pushed to my own folk here. could you help to test?
Hey,
I've some difficulties running the script on my AWS instance, looks like the pipeline is not functional with last nextflow versions. I've asked someone who used the older version on NSCC to test it for me, hopefully this should work. If you want to quickly try on your side, you can use this sample. Previous version should remove more than 60% of the reads while the new one should remove less than 1%. I've been using the last T2T reference genome (GCF_009914755.1) but there should not be a lot of variation if you keep using hg19.
Best, JS
Hi Chenhao,
it looks like your pipeline can't be used with the last version of nextflow and we have issue installing the last pysam version on NSCC. Would you be able to update your nextflow script to make it work with recent nextflow version?
Regards, JS
Hi,
I need to know the exact issue because I have no problem with nextflow version.
Hey,
using the last nextflow version (22.10.1).
First I had to change the tuple declarations:
ubuntu$rm chenhao_new/pipeline_info/shotgunmetagenomics-nf_trace.txt && ~/Software/nextflow ~/Software/shotgunmetagenomics-nf/main.nf --outdir chenhao_new --reads /mnt/volume1/Projects/decont/data/MBS1230.*.fq.gz --decont_refpath GRCH38/ --decont_index grch38.ref.fasta.gz --profilers ''
N E X T F L O W ~ version 22.10.1
Launching `/home/ubuntu/Software/shotgunmetagenomics-nf/main.nf` [pensive_mcclintock] DSL2 - revision: 58cb5d678c
Unqualified output value declaration has been deprecated - replace `tuple prefix,..` with `tuple val(prefix),..`
-- Check script '/home/ubuntu/Software/shotgunmetagenomics-nf/./modules/decont.nf' at line: 17 or see '.nextflow.log' file for more details
But even with this I hit an issue:
ubuntu$ rm chenhao_new/pipeline_info/shotgunmetagenomics-nf_trace.txt && ~/Software/nextflow ~/Software/shotgunmetagenomics-nf/main.nf --outdir chenhao_new --reads /mnt/volume1/Projects/decont/data/MBS1230.*.fq.gz --decont_refpath GRCH38/ --decont_index grch38.ref.fasta.gz --profilers ''
N E X T F L O W ~ version 22.10.1
Launching `/home/ubuntu/Software/shotgunmetagenomics-nf/main.nf` [special_meucci] DSL2 - revision: 58cb5d678c
[- ] process > DECONT -
[- ] process > DECONT [ 0%] 0 of 1
[6f/4de2fb] process > DECONT (MBS1230) [100%] 1 of 1, failed: 1
[6f/4de2fb] process > DECONT (MBS1230) [100%] 1 of 1, failed: 1
[6f/4de2fb] process > DECONT (MBS1230) [100%] 1 of 1, failed: 1
Execution cancelled -- Finishing pending tasks before exit
WARN: Input tuple does not match input set cardinality declared by process `DECONT` -- offending value: [MBS1230, /mnt/volume1/Projects/decont/data/MBS1230.1.fq.gz]
WARN: Failed to render execution report -- see the log file for details
WARN: Failed to render execution timeline -- see the log file for details
WARN: Graphviz is required to render the execution DAG in the given format -- See http://www.graphviz.org for more info.
Error executing process > 'DECONT (MBS1230)'
Caused by:
Process requirement exceeds available CPUs -- req: 12; avail: 2
Command executed:
fastp -i MBS1230.1.fq.gz -I input.3 --stdout -j MBS1230.json -h MBS1230.html | \
bwa mem -k 19 -p -t 12 GRCH38/grch38.ref.fasta.gz - | \
decont_filter.py | \
samtools fastq -f12 -F256 -1 MBS1230_fastpdecont_1.fastq.gz -2 MBS1230_fastpdecont_2.fastq.gz -s MBS1230_fastpdecont_single.fastq.gz -
Command exit status:
-
Command output:
(empty)
Work dir:
/mnt/volume1/Projects/decont/work/6f/4de2fbf823a804264966a24ef02fe7
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
I could try to downgrade my version to yours but I'm not sure it's the best.
Thanks, JS
Nevermind, I'm able to run the pipeline now, I will update you once I've results.
Ok I confirm it's working, but I still had to modify the tuple information for the last nextflow version (it's a minor fix though).
If someone needs to quickly check again:
Using this command:
zcat ERR7671938_fastpdecont_1.fastq.gz | awk 'NR % 4 == 1' | sort -u | wc -l
Values may vary a bit depending of your reference sequence.
Thank you Chenhao. JS
Thank you for the feedback Sebastien. I will work out an updated pipeline for better compatibility.
Hi Sebastien,
I recently encountered a huge loss of reads using the same approach here comparing to kneaddata. I wonder what insert size you got for your sample. The ones I tested has insert size 200-350.
Chenhao
Hey,
I tested multiple samples with different insert sizes. Usually the insert size was ~ 300bp, but sometimes lower. With 150bp paired-end reads, this means very low or even negative inner distance. The negative inner distance (meaning overlap between the two pairs) is not the primary source of the error, it just makes it worst. The main issue is that with default BWA parameters, a 19bp seed is not hard to find against a human genome.
JS
I found this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7478626/pdf/mgen-6-393.pdf
Do you think we should just use bowtie2 for this purpose?
I have a quick experiment:
file format type num_seqs sum_len min_len avg_len max_len
adapterTrimmed.1_kneaddata_paired_1.fastq.gz FASTQ DNA 6,989,603 640,413,464 50 91.6 135
test.bowtie2_fastpdecont_1.fastq.gz FASTQ DNA 6,591,521 527,584,547 21 80 135
test.bowtie2.m38_fastpdecont_1.fastq.gz FASTQ DNA 5,534,518 438,489,518 21 79.2 135
test_fastpdecont_1.fastq.gz FASTQ DNA 414,462 39,252,775 15 94.7 135
These 4 rows correspond to using Kneaddata (customized bowtie2), fastp+bowtie2+samtools+kneaddata database, fastp+bowtie2+samtools+GRCm38 database, fastp+bwa+samtools+GRCm38.
This dataset is low biomass sample with lots of host DNA. I previously took the reads left from kneaddata and assembled them. However, many of the resulting contigs mapped to the host... So I think while BWA is too stringent, bowtie2 is too lenient?
With Niranjan, we end up agreeing using BWA with a more stringent filtering directly on the bam file. I don't think one mapper is very different than the other in this case. While they do have different approaches, the main problem I've seen is the difference of alignment encoding, with BWA being more lenient to accept a pair as properly paired while Bowtie and minimap2 look more stringent (and correct). Rather than which mapper you're using, it's more which options you're using and how you post-process your alignment which matters. We sometimes forget that mapper were originally designed for isolate and (human) population genomic purposes, and not decontamination or multi-mapping.
But maybe we overthink this a bit too much. Just increasing BWA seed length was enough to fix most of the issue:
It seems that kneaddata
is also using bowtie2. That being said, keep me updated if you find something new, I'm interested by the subject. Thanks!
JS
Hey Chenhao, somehow I found out that the new version is not behaving correctly with some samples, leaking out some host reads during the decontamination process. Until I find the reason why, I think it would be safer to revert to the previous version.
Hi Chenhao,
we discovered an issue with the contamination pipeline from this repo. Shortly, BWA default seed (k=19) allows 'random' alignments of bacterial reads against the human genome. This issue can be important when insert size is low with high reads overlap. I have some samples with more than 50% of raw reads miss-classified as human with this. I can send you an email with slides for more info if you want to.
Since some people in the lab keep using your pipeline for their analyses, I think it would be a nice idea to fix the issue in your original repo, update the CSB5 one once done. Issue is I don't know much about nextflow, I'm only using snakemake.
To fix this, I made a very simple python script to filter reads considered as human based on (subjective) 70% coverage and 70% identity. Based on my analysis, this remove most of the potential false positive.
For speed purpose, I did not implemented a second step check where I check if the mate is unmapped or not. But this can be fixed with samtools fastq using the
-s
option. Here is my snakemake shell line:Do you think you could quickly update your repo to reflect those changes? Let me know if you're too busy and I will try to see how to do it on my own.
Regards, JS