nf-core / pathogensurveillance

Surveillance of pathogens using population genomics and sequencing
https://nf-co.re/pathogensurveillance
MIT License
13 stars 5 forks source link

Mixed dataset issue-BBsketch misclassified one sample as a pseudomonad rather than nematode and spades ran out of memory #69

Closed masudermann closed 6 months ago

masudermann commented 6 months ago

Description of the bug

Input test dataset was 'mixed.csv'

Pipeline ran as expected (though it errored out when it couldn't handle the fusarium sample, which had both PE and SE raw reads downloaded-reported perviously).

At spades step, it tried to assemble downloaded reads (accession ERR3842626), and stated it ran out of memory. We don't expect an assembly for this sample.

Upon closer inspection, I realized that this sample was misclassified as a bacterium during the sendsketch initial classification. (Likely some reads were contaminated).

What is concerning is that the two classifications below were the only ones given:

Pseudomonas sp. Irchel 3H3 Pseudomonas sp. Irchel s3h17

How are Sendsketch filtered and then reported?

I worry, especially for eukaryotic samples, that if we filter these sendsketch results to just include top hits, the relevant hit won't be included. I've noticed at times, if a strain has some contamination (or there is misclassification due to limited database resources), the proper assignment (sometimes just to genus level) isn't until later in the files.

Command used and terminal output

# testdataset file:
sample_id,sra,color_by,organism_group,kingdom,report
PHW726_fox_matthiolae,SRR10432276,organism_group,fungi,fungi,mixed;fungi
RKN_Menterolobii,ERR3842626,organism_group,nematode,metazoa,mixed;nematode
BDM_Pbelbahrii,ERR1578941,organism_group,oomycete,sar,mixed;oomycete
HoneyBee_Adorsata,SRR1564144,organism_group,insect,metazoa,mixed;insect
Rsol_Rsolanacearum,SRR19621834,organism_group,bacteria,bacteria,mixed;bacteria
PpalZOC03_Ppalmivora,SRR10483368,organism_group,oomycete,sar,mixed;oomycete

# Command
nextflow run main.nf --input /home/marthasudermann/pathogensurveillance/test/data/metadata/mixed.csv --outdir test_mixed_camilo --bakta_db /home/marthasudermann/Software/bakta_db_02_2024/db/ -profile docker -resume

Relevant files

xecutor > local (39) [6d/2b8e9d] process > PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (mixed.csv) [100%] 1 of 1, cached: 1 ✔ [c2/9657d6] process > PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP (HoneyBee_Adorsata) [100%] 6 of 6, cached: 6 ✔ [- ] process > PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES - [- ] process > PATHOGENSURVEILLANCE:SEQKIT_SLIDING - [bb/7f2ba1] process > PATHOGENSURVEILLANCE:FASTQC (Rsol_Rsolanacearum) [100%] 6 of 6, cached: 6 ✔ [69/fc7c81] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH (PpalZOC03_Ppalmivora) [100%] 6 of 6, cached: 6 ✔ [0d/1b93c6] process > PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:INITIAL_CLASSIFICATION (BDM_Pbelbahrii) [100%] 6 of 6, cached: 6 ✔ [4d/9abcaa] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:FIND_ASSEMBLIES (Megachilidae) [100%] 20 of 20, cached: 20 ✔ [b9/ce65c6] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:PICK_ASSEMBLIES (BDM_Pbelbahrii) [100%] 6 of 6, cached: 6 ✔ [25/baf51b] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:DOWNLOAD_ASSEMBLIES (GCA_022627115_1) [100%] 91 of 91, cached: 90 ✔ [c2/f70553] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:MAKE_GFF_WITH_FASTA (GCA_022627115_1) [100%] 91 of 91, cached: 90 ✔ [26/b96e84] process > PATHOGENSURVEILLANCE:DOWNLOAD_REFERENCES:SOURMASH_SKETCH_GENOME (GCA_022627115_1) [100%] 91 of 91, cached: 90 ✔ [4a/ca36aa] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SUBSET_READS (Rsol_Rsolanacearum) [100%] 6 of 6, cached: 6 ✔ [fa/232d14] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:KHMER_TRIMLOWABUND (BDM_Pbelbahrii) [100%] 6 of 6, cached: 6 ✔ [cb/09ece6] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_READS (PpalZOC03_Ppalmivora) [100%] 6 of 6, cached: 6 ✔ [- ] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_SKETCH_GENOME - [46/21755a] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:SOURMASH_COMPARE (all) [100%] 1 of 1 ✔ [ec/3bf96f] process > PATHOGENSURVEILLANCE:ASSIGN_REFERENCES:ASSIGN_GROUP_REFERENCES (all) [100%] 1 of 1 ✔ [3a/1ed958] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:PICARD_CREATESEQUENCEDICTIONARY (GCA_000365545_1) [100%] 6 of 6, cached: 4 ✔ [72/5f5cc9] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:SAMTOOLS_FAIDX (GCF_001855495_2_genomic.fna) [100%] 6 of 6, cached: 4 ✔ [ce/d7321d] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:REFERENCE_INDEX:BWA_INDEX (GCF_014066325_1) [100%] 6 of 6, cached: 3 ✔ [d8/b2abac] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:CALCULATE_DEPTH (GCF_900187635_1_RKN_Menterolobii) [100%] 6 of 6, cached: 1 ✔ [da/deef4b] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SUBSET_READS (GCF_900187635_1_RKN_Menterolobii) [100%] 6 of 6, cached: 1 ✔ [f1/2c75f8] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:BWA_MEM (GCF_900187635_1_RKN_Menterolobii) [100%] 3 of 3, cached: 1 [25/0cf853] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_ADDORREPLACEREADGROUPS (GCA_900096695_1_PHW726_fox_matthiolae) [100%] 3 of 3, cached: 1 [82/e4d428] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_1 (GCA_900096695_1_PHW726_fox_matthiolae) [100%] 3 of 3, cached: 1 [41/1a851c] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_MARKDUPLICATES (GCA_900096695_1_PHW726_fox_matthiolae) [100%] 3 of 3, cached: 1 [e1/d747f7] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:PICARD_SORTSAM_2 (GCA_900096695_1_PHW726_fox_matthiolae) [100%] 3 of 3, cached: 1 [10/785b60] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:ALIGN_READS:SAMTOOLS_INDEX (GCA_900096695_1_PHW726_fox_matthiolae) [100%] 3 of 3, cached: 1 [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:MAKE_REGION_FILE - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GRAPHTYPER_GENOTYPE - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GRAPHTYPER_VCFCONCATENATE - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:TABIX_TABIX - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:BGZIP_MAKE_GZIP - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:GATK4_VARIANTFILTRATION - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:CALL_VARIANTS:VCFLIB_VCFFILTER - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:VCF_TO_TAB - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:VCF_TO_SNPALN - [- ] process > PATHOGENSURVEILLANCE:VARIANT_ANALYSIS:IQTREE2_SNP - [49/d4b1e1] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SUBSET_READS (RKN_Menterolobii) [100%] 2 of 2, cached: 2 ✔ [e2/3f9b29] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:FASTP (Rsol_Rsolanacearum) [100%] 2 of 2, cached: 2 ✔ [58/33c597] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SPADES (RKN_Menterolobii) [100%] 3 of 3, cached: 1, failed: 2, retries:... [5d/2d4049] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:FILTER_ASSEMBLY (Rsol_Rsolanacearum) [100%] 1 of 1, cached: 1 ✔ [- ] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:QUAST [ 0%] 0 of 1 [d5/62c050] process > PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:BAKTA_BAKTA (Rsol_Rsolanacearum) [100%] 1 of 1, cached: 1 ✔ [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:PIRATE [ 0%] 0 of 1 [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:REFORMAT_PIRATE_RESULTS - [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:CALCULATE_POCP - [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:ALIGN_FEATURE_SEQUENCES - [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:RENAME_CORE_GENE_HEADERS - [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:SUBSET_CORE_GENES - [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:MAFFT_SMALL - [- ] process > PATHOGENSURVEILLANCE:CORE_GENOME_PHYLOGENY:IQTREE2_CORE - [- ] process > PATHOGENSURVEILLANCE:CUSTOM_DUMPSOFTWAREVERSIONS - [- ] process > PATHOGENSURVEILLANCE:MULTIQC - [e9/2dab17] process > PATHOGENSURVEILLANCE:RECORD_MESSAGES (All) [ 50%] 1 of 2, cached: 1 [- ] process > PATHOGENSURVEILLANCE:PREPARE_REPORT_INPUT - [- ] process > PATHOGENSURVEILLANCE:MAIN_REPORT - ERROR ~ Error executing process > 'PATHOGENSURVEILLANCE:GENOME_ASSEMBLY:SPADES (RKN_Menterolobii)'

System information

No response

masudermann commented 6 months ago

Upon closer inspection, it seems the nematode sample may have been very contaminated to start.

The peronospora input data was actually RNA-seq data, not genomic DNA, which could help explain the bizarre classification.

We are rerunning this dataset with a few more samples. I think as long as there isn't a stringent filtering of bbsketch results, that these were data specific not pipeline specific concerns.