nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
182 stars 115 forks source link

Barrnap filtration #714

Closed vaulot closed 6 months ago

vaulot commented 6 months ago

Description of the bug

I am analysing PacBio eukaryotic metabarcodes that cover the 18S, ITS and 28S.

When filtering based on barrnap I find very few ASVs passing the filter because of the low eval assigned to euks. However this is due to the fact that barrnap detects the three genes and that the last line e-value seems to be retained

Here is the output of barrnap for euks

##gff-version 3
0009676eb56185361df109c5aab42d3a    barrnap:0.9 rRNA    1   1218    1.2e-303    +   .   Name=18S_rRNA;product=18S ribosomal RNA (partial);note=aligned only 65 percent of the 18S ribosomal RNA
0009676eb56185361df109c5aab42d3a    barrnap:0.9 rRNA    1413    4701    0   +   .   Name=28S_rRNA;product=28S ribosomal RNA
0009676eb56185361df109c5aab42d3a    barrnap:0.9 rRNA    1419    1570    1.5e-27 +   .   Name=5_8S_rRNA;product=5.8S ribosomal RNA

Now the summary looks like

ASV_ID  arc_eval    bac_eval    euk_eval    mito_eval   eval_method
0009676eb56185361df109c5aab42d3a    4.8e-84 1.7e-124    1.5e-27 1e-17   barrnap:0.9

So that this ASV is filtered out as bacteria since the e-val in this table is higher. However it is the second line of the barrnap output that should be used in this case (the minimum e-value) and the ASVs will be correctly assigned as euk.

Cheers.

Command used and terminal output

No response

Relevant files

No response

System information

No response

d4straub commented 6 months ago

Thanks for the report. As far as I remember, this was implemented having one target gene in mind, not several. With multiple genes the barrnap filtering seems to be not working as intended. This should be either made clear in the documentation or (optimally) fixed for multiple gene regions. Would you be up to that?

vaulot commented 6 months ago

Hi Daniel

I think a simple fix would be to take the lowest e-value. It would be useful for PacBio sequences since more and more cover the whole operon. If you point me to where the change need to be made, I can try to fix it.

d4straub commented 6 months ago

Barrnap gffs are parsed in https://github.com/nf-core/ampliseq/blob/d05e42ea645696bec011fde03ca4e9c26df6cf45/modules/local/barrnapsummary.nf that uses https://github.com/nf-core/ampliseq/blob/d05e42ea645696bec011fde03ca4e9c26df6cf45/bin/summarize_barrnap.py. Maybe that script needs a fix.

d4straub commented 6 months ago

Was fixed in https://github.com/nf-core/ampliseq/pull/722!