ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/
Other
265 stars 37 forks source link

Batch / Parallel running with .fasta (.ffn) ; .faa ; and .gff from Prokka output #104

Closed rickyalfaray closed 1 year ago

rickyalfaray commented 1 year ago

Hello,

Thank you for developing this valuable tool!

I have trouble running .fasta (.fnn) ; .faa ; and .gff file from Prokka output even though I used --annotation_format prokka.

So, I have 3 of these files from prokka annotation, and I put them into separate folders. When I run: amrfinder -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/nucleotide/1_IND07.fasta -p /data/Ricky/Global_ARG_Study/AMR_finder_plus/protein/1_IND07.faa -g /data/Ricky/Global_ARG_Study/AMR_finder_plus/gff/1_IND07.gff --annotation_format prokka --threads 20 > /data/Ricky/Global_ARG_Study/AMR_finder_plus/results.tab

The results error is:

Running: amrfinder -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/nucleotide/1_IND07.fasta --annotation_format prokka -p /data/Ricky/Global_ARG_Study/AMR_finder_plus/protein/1_IND07.faa -g /data/Ricky/Global_ARG_Study/AMR_finder_plus/gff/1_IND07.gff --threads 20
Software directory: '/data/Ricky/Software/Miniconda/envs/AMR/bin/'
Software version: 3.11.2
Database directory: '/data/Ricky/Software/Miniconda/envs/AMR/share/amrfinderplus/data/2022-10-11.2'
Database version: 2022-10-11.2
AMRFinder combined translated and protein search
  - include -O ORGANISM, --organism ORGANISM option to add mutation searches and suppress common proteins

GFF file mismatch.
*** ERROR ***
gff_check.cpp: GFF contig id "gnl|X|1_IND07_1" is not in the DNA FASTA file

HOSTNAME: expsrv
SHELL: /bin/bash
PWD: /data/Ricky/Global_ARG_Study
PATH: /home/student/Ricky/perl5/perlbrew/bin:/data/Ricky/Software/Miniconda/envs/AMR/bin:/data/Ricky/Software/Miniconda/condabin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/dell/srvadmin/bin
Progam name:  gff_check
Command line: /data/Ricky/Software/Miniconda/envs/AMR/bin/gff_check /data/Ricky/Global_ARG_Study/AMR_finder_plus/gff/1_IND07.gff -gfftype prokka -prot /data/Ricky/Global_ARG_Study/AMR_finder_plus/protein/1_IND07.faa -dna /data/Ricky/Global_ARG_Study/AMR_finder_plus/nucleotide/1_IND07.fasta -log /tmp/amrfinder.evtCyr/log

It is working well for: amrfinder --annotation_format prokka -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/nucleotide/1_IND07.fasta --threads 20 > /data/Ricky/Global_ARG_Study/AMR_finder_plus/results.tab

I understand it is because the contig id "gnl|X|1_IND07_1" is not the same as the contig id in the .fasta (.ffn) file, but I don't know how to fix it, especially if we have many WGS strains to run. Could you please help me with what I should do? I have attached the files for reference: google drive

In addition, I also want to run for 1000 WGS strains in parallel. Because I'm unfamiliar with the command line, can you please help me set up the command?

Thank you very much!

Warm regards. Ricky

vbrover commented 1 year ago

The nucleotide FASTA files you are using are not genomes where CDSs (coding segments of proteins) are sought, but the CDSs of proteins (so that the sequence identifiers in the protein files and nucleotide files are the same). You should have the original nucleotide sequences which have been used as input for prokka. For the 1_IND07 set, for example, they should have sequence identifiers like gnl|X|1_IND07_1. Can you use them in the -n parameter of amrfinder?

I also want to run for 1000 WGS strains in parallel.

AMRFinderPlus can only parallelize BLAST and HMMer. To run many instances of amrfinder in parallel we use the UGE grid engine.

vbrover commented 1 year ago

The goal of the "combined" AMRFinderPlus which takes 3 files as input (protein, nucleotide and GFF) is to get additional genes from nucleotide sequences which are missed in the protein file. But the nucleotide files you are using are CDSs, so that they do not add any new information given the protein sequences (except tRNAs, but they are not AMR). Therefore, if you are limited to the files you have posted, then you can run amrfinder -p <protein file> ignoring the nucleotide and GFF files to get the same result.

evolarjun commented 1 year ago

Hi @rickyalfaray,

Slava is right above about AMRFinderPlus taking the nucleotide assembly sequence, protein, and annotation GFF. You just need to use the assembly FASTA instead of the CDS output from PROKKA.

But also I wanted to mention there are many ways to run AMRFinderPlus in parallel. We use a UGE cluster at NCBI, but I've done it on many other clusters, using GNU parallel single-threaded on a machine with lots of cores, or just serially using a shell for loop.

I find I get the most done in parallel (as long as you have plenty of RAM and temporary disk space (which hasn't usually been a limitation) by running single-threaded --threads 1. Each job usually takes somewhere around 8-10 minutes, but I can then make maximal use of every available core since AMRFinderPlus is generally CPU-bound.

Again there are many ways to do this, so you might need to consult someone who can help you with shell scripting. As a head start here's the simplest thing I can think of, though it will take a while to get through all 1,000 assemblies.

Assuming all your files are named starting with a number that increments similarly to the way the files in the google drive you pointed at do you should be able to run things serially (one-after-another) like this (though remember the -n file should be the assembly, not the CDS file) There are a lot of little details I could be getting wrong so I can't really test this (e.g., paths, the exact format of the filenames, the number of assemblies, etc.) so no guarantees it works as written.

for i in `seq 1 1000`
do
    base=`basename gff/${i}_* .gff
    amrfinder -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/$base.fasta \
        -p /data/Ricky/Global_ARG_Study/AMR_finder_plus/protein/$base.faa \
        -g /data/Ricky/Global_ARG_Study/AMR_finder_plus/gff/$base.gff \
        --annotation_format prokka --threads 20 --name $base
done | awk 'NR == 1 || !/Protein identifier/ { print }' > combined.amrfinder

Hope that helps, Arjun

rickyalfaray commented 1 year ago

Thank you very much @vbrover @evolarjun

I have already tried it again. Now it works!

I'm using the fasta file that @vbrover suggested (apparently, the .fna file from Prokka output is the assembly one. The one with .ffn is the CDS one. So I use the .fna one), then running with gff and faa files serially using a shell for loop as @evolarjun suggested with slight modification.

It was helpful. Thank you!

One suggestion, I hope you can also include H. pylori resistant gene and virulence factor in the future. Thanks ^^

Sincerely, Ricky

evolarjun commented 1 year ago

@rickyalfaray Thanks for the update and I'm glad it's working for you

I'll bring up H. pylori with our curators. We haven't looked in detail at H. pylori, but we should have pretty good coverage of known acquired resistance genes, especially beta-lactamases. However we have not curated point mutations for H. pylori. If there's something prominent you think we're missing let us know.

I don't know how much mutational mechanisms play into H. pylori resistance, but if you know of papers describing mutational resistance mechanisms (e.g., point mutations) you recommend, I can pass them on to the curation team. I can't make any guarantees as to if/when we'll get to it, but I can bring it up.

Arjun

rickyalfaray commented 1 year ago

@evolarjun @vbrover I have already finished running thousands of WGS of H. pylori. However, is it possible to get 0 results? No resistance gene was detected. When I run by adding the --plus flag, I got that every strain had copB gene, but this does not belong to the AMR gene.

Is it possible because H. pylori are not in the database? I tried to run also using other tools and databases such as RGI-CARD, ResFam, ResFinder, ARG-ANNOT, etc., (with most of them also not containing H. pylori db), however, by simply using identity 90 and coverage 90, all of them gave the results even though only a little (e.g., each strain should have an MFS efflux system). Could you please give me your opinion?

@evolarjun Thank you, it would be great! The mutation of the antibiotic-targeted genes is believed to be the main reason for the H. pylori resistant mechanism instead of present/absent of resistant gene. Yes, I have one recommendation paper information. Here is the paper: Helicobacter pylori infection and antibiotic resistance — from biology to clinical implications I appreciate your kindness, and I am looking forward to it.

Happy new year!

Ricky