Closed rickyalfaray closed 1 year ago
The nucleotide FASTA files you are using are not genomes where CDSs (coding segments of proteins) are sought, but the CDSs of proteins (so that the sequence identifiers in the protein files and nucleotide files are the same). You should have the original nucleotide sequences which have been used as input for prokka
. For the 1_IND07
set, for example, they should have sequence identifiers like gnl|X|1_IND07_1
. Can you use them in the -n
parameter of amrfinder
?
I also want to run for 1000 WGS strains in parallel.
AMRFinderPlus can only parallelize BLAST and HMMer.
To run many instances of amrfinder
in parallel we use the UGE grid engine.
The goal of the "combined" AMRFinderPlus which takes 3 files as input (protein, nucleotide and GFF) is to get additional genes from nucleotide sequences which are missed in the protein file. But the nucleotide files you are using are CDSs, so that they do not add any new information given the protein sequences (except tRNAs, but they are not AMR). Therefore, if you are limited to the files you have posted, then you can run amrfinder -p <protein file>
ignoring the nucleotide and GFF files to get the same result.
Hi @rickyalfaray,
Slava is right above about AMRFinderPlus taking the nucleotide assembly sequence, protein, and annotation GFF. You just need to use the assembly FASTA instead of the CDS output from PROKKA.
But also I wanted to mention there are many ways to run AMRFinderPlus in parallel. We use a UGE cluster at NCBI, but I've done it on many other clusters, using GNU parallel single-threaded on a machine with lots of cores, or just serially using a shell for loop.
I find I get the most done in parallel (as long as you have plenty of RAM and temporary disk space (which hasn't usually been a limitation) by running single-threaded --threads 1
. Each job usually takes somewhere around 8-10 minutes, but I can then make maximal use of every available core since AMRFinderPlus is generally CPU-bound.
Again there are many ways to do this, so you might need to consult someone who can help you with shell scripting. As a head start here's the simplest thing I can think of, though it will take a while to get through all 1,000 assemblies.
Assuming all your files are named starting with a number that increments similarly to the way the files in the google drive you pointed at do you should be able to run things serially (one-after-another) like this (though remember the -n file should be the assembly, not the CDS file) There are a lot of little details I could be getting wrong so I can't really test this (e.g., paths, the exact format of the filenames, the number of assemblies, etc.) so no guarantees it works as written.
for i in `seq 1 1000`
do
base=`basename gff/${i}_* .gff
amrfinder -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/$base.fasta \
-p /data/Ricky/Global_ARG_Study/AMR_finder_plus/protein/$base.faa \
-g /data/Ricky/Global_ARG_Study/AMR_finder_plus/gff/$base.gff \
--annotation_format prokka --threads 20 --name $base
done | awk 'NR == 1 || !/Protein identifier/ { print }' > combined.amrfinder
Hope that helps, Arjun
Thank you very much @vbrover @evolarjun
I have already tried it again. Now it works!
I'm using the fasta file that @vbrover suggested (apparently, the .fna file from Prokka output is the assembly one. The one with .ffn is the CDS one. So I use the .fna one), then running with gff and faa files serially using a shell for loop as @evolarjun suggested with slight modification.
It was helpful. Thank you!
One suggestion, I hope you can also include H. pylori resistant gene and virulence factor in the future. Thanks ^^
Sincerely, Ricky
@rickyalfaray Thanks for the update and I'm glad it's working for you
I'll bring up H. pylori with our curators. We haven't looked in detail at H. pylori, but we should have pretty good coverage of known acquired resistance genes, especially beta-lactamases. However we have not curated point mutations for H. pylori. If there's something prominent you think we're missing let us know.
I don't know how much mutational mechanisms play into H. pylori resistance, but if you know of papers describing mutational resistance mechanisms (e.g., point mutations) you recommend, I can pass them on to the curation team. I can't make any guarantees as to if/when we'll get to it, but I can bring it up.
Arjun
@evolarjun @vbrover
I have already finished running thousands of WGS of H. pylori.
However, is it possible to get 0 results?
No resistance gene was detected.
When I run by adding the --plus
flag, I got that every strain had copB gene, but this does not belong to the AMR gene.
Is it possible because H. pylori are not in the database? I tried to run also using other tools and databases such as RGI-CARD, ResFam, ResFinder, ARG-ANNOT, etc., (with most of them also not containing H. pylori db), however, by simply using identity 90 and coverage 90, all of them gave the results even though only a little (e.g., each strain should have an MFS efflux system). Could you please give me your opinion?
@evolarjun Thank you, it would be great! The mutation of the antibiotic-targeted genes is believed to be the main reason for the H. pylori resistant mechanism instead of present/absent of resistant gene. Yes, I have one recommendation paper information. Here is the paper: Helicobacter pylori infection and antibiotic resistance — from biology to clinical implications I appreciate your kindness, and I am looking forward to it.
Happy new year!
Ricky
Hello,
Thank you for developing this valuable tool!
I have trouble running .fasta (.fnn) ; .faa ; and .gff file from Prokka output even though I used
--annotation_format prokka
.So, I have 3 of these files from prokka annotation, and I put them into separate folders. When I run:
amrfinder -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/nucleotide/1_IND07.fasta -p /data/Ricky/Global_ARG_Study/AMR_finder_plus/protein/1_IND07.faa -g /data/Ricky/Global_ARG_Study/AMR_finder_plus/gff/1_IND07.gff --annotation_format prokka --threads 20 > /data/Ricky/Global_ARG_Study/AMR_finder_plus/results.tab
The results error is:
It is working well for:
amrfinder --annotation_format prokka -n /data/Ricky/Global_ARG_Study/AMR_finder_plus/nucleotide/1_IND07.fasta --threads 20 > /data/Ricky/Global_ARG_Study/AMR_finder_plus/results.tab
I understand it is because the contig id "gnl|X|1_IND07_1" is not the same as the contig id in the .fasta (.ffn) file, but I don't know how to fix it, especially if we have many WGS strains to run. Could you please help me with what I should do? I have attached the files for reference: google drive
In addition, I also want to run for 1000 WGS strains in parallel. Because I'm unfamiliar with the command line, can you please help me set up the command?
Thank you very much!
Warm regards. Ricky