Question about ariba summary

ssjunnebo commented 6 years ago

Query from RT:

I don’t understand a summary ariba is running, where what looks like good hits get discarded. If I look at e.g. ariba_BBVQ01_ariba_eff_out/report.tsv

the hits for cluster 38 (espF) look fine. When I then use the sequence for the cluster 38 from

zless ariba_BBVQ01_ariba_eff_out/assemblies.fa.gz

and run a blastx, this readily recovers the espF protein

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&RID=U4BJE4FS01N

However, in the summary it shows as “no” in the short summary, and in the longer summary, it is not listed as assembled (not even as fragment). Could you please let me know why this is not counted as hit?

sgonzalezbodi commented 4 years ago

Dear all,

We have observed similar results in our analysis working with card and VFDB databases. The parameters of the single report csv seem correct. However, these hits did not appear at the summary csv. Someone has any idea how to understand it?

Regards

martinghunt commented 4 years ago

I can try to debug. Could you please provide example data, the command(s) run to reproduce the output, and version of ariba you're using?

sgonzalezbodi commented 4 years ago

Dear Martin,

First of all, many thanks for your kind help. I will try to describe the analysis that we are carrying out.

We work at the microbiology department of the Hospital Universitario 12 de Octubre in Madrid (Spain). Our main researches are focus on the analysis of pathogens responsible for nosocomial infections. Actually we are studying different strains of Staphylococcus aureus and *P*seudomonas aeruginosa involve in hospital and clinical care. For this purpose, we ran ariba with VFDB core, CARD and plasmidfinder databases following the below scripts in each step:

ARIBA version: 2.14.5

1. Get reference data

cat ../ariba_databases.txt | xargs -I % echo "mkdir -p %;ariba getref % %/out.%" > _08_getref.sh

2. Prepare reference data for ARIBA

cat ../ariba_databases.txt | xargs -I % echo "mkdir -p %;ariba prepareref -f ../getref/%/out.%.fa -m ../getref/%/out.%.tsv %/out.%.prepareref" > _09_prepareref.sh

Run the ARIBA local assembly pipeline

for database in $(cat ../ariba_databases.txt)

do

echo "$database"

cat samples_id.txt | xargs -I % echo "ariba run ../prepareref/"$database"/out."$database".prepareref ../../02-preprocessig/%/%_R1.clean_qc_pair.fastq.gz ../../02-preprocessig/%/%_R2.clean_qc_pair.fastq.gz %/%."$database".out.run">> _09_ariba_run.sh

done

4. This summarises the results from one or more runs of ARIBA

for database in $(cat ariba_databases.txt)

do

mkdir -p $database

tmp=cat samples_id.txt | xargs -I % echo ../run/%/%.$database.out.run/report.tsv |tr "\n" " "

echo ariba summary $database/out.summary $tmp >>11_ariba_summary.sh

done

The issue that we have observed is that checking the summary tsv files important targets in both pathogens are missing or characterized are absent even if they have more than 90% of identity and more than 90%of coverage to the reference. However, these targets are described at the single report of each sample. For this reason, we have developed a script to create the summary tsv that includes all the targets that meet these requirements.

A specific example with pseudomonas is the oprD gene as you can see in the sample report it has a 95 identity and 100% coverage. Furthermore, the exoY factor with 99% of identity and 100% of coverage.

Please, find attached the following files:

single_report_ariba.tsv: Example of card report from one sample. You can check the presence of oprD gene and its parameters.

single_report_vfdb.tsv:Example of VFDB report from one sample. You can check teh presence of exoY factor and its parameters

out.summary_CARD_ariba.csv / out.summary_VFDB_ariba.csv: Summary file produced by ariba working with summary task using both databases.

ariba_summary_CARD_all_summary _presence.csv and ariba_summary_VFDB_all_summary _presence.csv : the summary files of CARD and VFDB create with our script

files.py: script to create our own summary files in both databases.

Could you please let me know which parameters are you using to select or discard each gene of the summary file?

Kind regards

Sara

El vie., 26 jun. 2020 a las 12:45, martinghunt (notifications@github.com) escribió:

I can try to debug. Could you please provide example data, the command(s) run to reproduce the output, and version of ariba you're using?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/ariba/issues/246#issuecomment-650114544, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMFAO6E7DCFCUK6IVVRVAWLRYR35FANCNFSM4GHAI2FA .

sanger-pathogens / ariba

Question about ariba summary #246