meglab-metagenomics / amrplusplus_v2

MEGARes and AmrPlusPlus - A comprehensive database of antimicrobial resistance genes and user-friendly pipeline for analysis of high-throughput sequencing data
http://megares.meglab.org/
MIT License
25 stars 14 forks source link

Clarification on awk step showing $3="RequiresSNPConfirmation" #12

Open MatteoSchiavinato opened 3 years ago

MatteoSchiavinato commented 3 years ago

Hi,

I found myself without any perfect hits on RGI and I was digging within the nextflow code to find the reason. I realized that at line 501 and 630 of the *withRGI_Kraken.nf code there is an awk command that shows $3="RequiresSNPConfirmation".

My understanding is that this command searches for a pattern in the target sequence name, and the pattern is RequiresSNPConfirmation.

Shouldn't it be $3 ~ /RequiresSNPConfirmation/ in that case?

Computational times drop from 3 hours to 5 minutes.

VoronDM commented 3 years ago

I find in nextflow.config:

    /* Location of SNP confirmation script */
    snp_confirmation = "bin/snp_confirmation.py"

But there are no file snp_confirmation.py in /bin and nowhere

I see such output when using testing data:

[ab/0354ae] process > SNPconfirmation (S3_test)                       [100%] 3 of 3, failed: 3 ✔
[-        ] process > Confirmed_AMR_hits                              -
[f6/105481] process > Confirmed_ResistomeResults (null)               [100%] 1 of 1, failed: 1 ✔
[36/3101e4] process > ExtractDedupSNP (S3_test)                       [100%] 3 of 3 ✔
[84/90e800] process > RunDedupRGI (S3_test)                           [100%] 3 of 3 ✔
[91/322d1f] process > DedupSNPconfirmation (S3_test)                  [100%] 3 of 3, failed: 3 ✔
[-        ] process > ConfirmDedupAMRHits                             -
[81/aec48e] process > DedupSNPConfirmed_ResistomeResults (null)       [100%] 1 of 1, failed: 1 ✔
MatteoSchiavinato commented 3 years ago

@VoronDM I think yours is a different issue and should be opened as such. However, from what I could gather, the SNP confirmation is performed without external scripts so that may be a leftover from a previous version.

jannemm commented 3 years ago

Hi,

Yeah, I found the same thing. As the $3="RequiresSNPConfirmation" doesn't filter anything (rather, it assigns "RequiresSNPConfirmation" to $3), all reads from the sample are submitted to RGI leading to quite large runtimes (for me, up to 12 hrs using 40 threads on a 25GB fastq file).

Replacing the awk statement with the following one-liner solves the problem for me: awk '{ if (\$3 ~ "RequiresSNPConfirmation") print ">" \$1 "\\n" \$10 }' ${sam} > ${sample_id}.snp.fasta

I omitted the match for line beginning with @..., as there is no need to filter on header lines - it is only in the reads section of the sam-file that "RequiresSNPConfirmation" is present in field 3 on the line.