ncbi / amr

AMRFinderPlus - Identify AMR genes and point mutations, and virulence and stress resistance genes in assembled bacterial nucleotide and protein sequence.
https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/
Other
265 stars 37 forks source link

BLASTN causing crash/core dump with ~1% of samples (tested on 3.11.2 and 3.11.11) #118

Closed dutchscientist closed 1 year ago

dutchscientist commented 1 year ago

I am running >20k Salmonella genomes with AMRfinder using the "--plus" switch and "-O Salmonella". In about 1% of the samples it will crash once the BLASTN starts for the point mutation search; if I run it without the -O switch, it is fine with the same sequences. I first thought it could have to do with long contig names, but after running them through Prokka with renamed contig names, it still causes failures.

Below is the output when crashing:

ERROR '/home/username/mambaforge/envs/genotyping/bin/blastn' -query 'Salm000001fna/Salm000048.fna' -db /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella -evalue 1e-20 -dust no -max_target_seqs 10000 -num_threads 2 -mt_mode 1 -outfmt '6 qseqid sseqid qstart qend qlen sstart send slen qseq sseq' -out /tmp/amrfinder.4Qo99F/blastn > /tmp/amrfinder.4Qo99F/log 2> /tmp/amrfinder.4Qo99F/blastn-err status = 35584 Segmentation fault (core dumped)

Anything that can be done for this? Thanks :)

vbrover commented 1 year ago

If you run

'/home/username/mambaforge/envs/genotyping/bin/blastn' -query 'Salm000001fna/Salm000048.fna' -db /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella -evalue 1e-20 -dust no -max_target_seqs 10000 -num_threads 2 -mt_mode 1 -outfmt '6 qseqid sseqid qstart qend qlen sstart send slen qseq sseq' -out xxx

do yo get the same crash?

What is the contents of the below files?

/tmp/amrfinder.4Qo99F/blastn 
/tmp/amrfinder.4Qo99F/log 
/tmp/amrfinder.4Qo99F/blastn-err

And what is the version of amrfinderand blastn?

vbrover commented 1 year ago

What is the result of these commands?

ls -laF /home/username/mambaforge/envs/genotyping/bin/blastn
ls -laF Salm000001fna/Salm000048.fna
ls -laF /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella
ls -laF /tmp/amrfinder.4Qo99F/blastn 
ls -laF /tmp/amrfinder.4Qo99F/log 
ls -laF /tmp/amrfinder.4Qo99F/blastn-err
dutchscientist commented 1 year ago

Tried it with amrfinder 3.11.11 (Python 3.7) and 3.11.2 (Python 3.10). The BLAST version is BLAST 2.13.0+ in both cases, running on Ubuntu 22.04 LTS in two Mamba environments. The Database version used is: 2023-04-17.1

With the commandline suggestion, I still get "Segmentation fault (core dumped)"

blastn: "" (empty, 0 bytes) log: "" (empty, 0 bytes) blastn-err: "Segmentation fault (core dumped)"

As said, the weird thing is it only happens in a minority of genomes,

dutchscientist commented 1 year ago

-rwxrwxr-x 4 vetschool vetschool 276776 Jul 19 2022 /home/vetschool/mambaforge/envs/genomics/bin/blastn* -rw-rw-r-- 1 vetschool vetschool 5093967 Apr 27 19:06 Salm000001fna/Salm000048.fna -rw-rw-r-- 1 vetschool vetschool 1612 Apr 21 00:31 /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella -rw-rw-r-- 1 vetschool vetschool 0 Apr 29 18:03 /tmp/amrfinder.4Qo99F/blastn -rw-rw-r-- 1 vetschool vetschool 0 Apr 29 18:03 /tmp/amrfinder.4Qo99F/log -rw-rw-r-- 1 vetschool vetschool 33 Apr 29 18:03 /tmp/amrfinder.4Qo99F/blastn-err

(username = vetschool, genotyping is the env for Python 3.10 which only allows amrfinder 3.11.2, genomics is the env for Python 3.7 which allows amrfinder 3.11.11)

vbrover commented 1 year ago

Since the bug is reproducible, could you post Salm000001fna/Salm000048.fna?

Can you try BLASTN ver. 2.14.0+?

dutchscientist commented 1 year ago

Blast 2.14 is not available yet via Conda/Mamba?

The Salm000048.fna file is available on https://drive.google.com/file/d/11JmHcvVhjvgJz1JFxrokD7PIvyjOw3Rv/view?usp=sharing.

dutchscientist commented 1 year ago

Salm000048.zip

vbrover commented 1 year ago

I have tried blastn ver. 2.13.0+ and 2.14.0+ and the both worked on Salm000048.fna with exit code 0 and an empty output file.

Let's check that the blast database is available. What is the result of this command?

ls -laF /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella*

Is there enough disk space?

vbrover commented 1 year ago

AMR_DNA-Salmonella*

dutchscientist commented 1 year ago

-rw-rw-r-- 1 vetschool vetschool 1612 Apr 26 17:20 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella -rw-rw-r-- 1 vetschool vetschool 20480 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.ndb -rw-rw-r-- 1 vetschool vetschool 117 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nhr -rw-rw-r-- 1 vetschool vetschool 160 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nin -rw-rw-r-- 1 vetschool vetschool 572 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.njs -rw-rw-r-- 1 vetschool vetschool 20 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.not -rw-rw-r-- 1 vetschool vetschool 386 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nsq -rw-rw-r-- 1 vetschool vetschool 16384 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.ntf -rw-rw-r-- 1 vetschool vetschool 8 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nto -rw-rw-r-- 1 vetschool vetschool 406 Apr 26 17:20 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.tab

I rebooted the computer, now Salm000048.fna did work, took the next one (Salm000070.fna) which does crash again, hence the change in code Bvch0H.

The output of df -h is: Filesystem Size Used Avail Use% Mounted on tmpfs 4.8G 1.4M 4.8G 1% /run /dev/sda1 473G 162G 287G 37% / tmpfs 24G 0 24G 0% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock virtualbox_shared 7.3T 1.5T 5.9T 20% /media/sf_virtualbox_shared tmpfs 4.8G 124K 4.8G 1% /run/user/1001 Plenty of space, >250 GB.

(virtualbox Ubuntu 22.04 LTS computer running in Windows)

dutchscientist commented 1 year ago

And now the next one works after a few tries. This is very irritating!

Thanks very much for your assistance, by the way!

vbrover commented 1 year ago

Your blastn has size 276776 whereas on my computer:

$ ls -laF blast/ncbi-blast-2.13.0+/bin/blastn
-rwxr-xr-x 1 brovervv pathogen 28839896 Feb  2  2022 blast/ncbi-blast-2.13.0+/bin/blastn*
vbrover commented 1 year ago

Are you working on a Windows computer emulating Ubuntu?

dutchscientist commented 1 year ago

I am working on a Windows 10 computer with Virtualbox 7.08, and a virtual Ubuntu 22.04 Linux computer with Conda (Mamba) environments. So it is Linux, not emulating.

I had a look at https://anaconda.org/bioconda/blast/files, and the BLAST file size is OK there? Blastn is about 270 kb.

vbrover commented 1 year ago

I will pass this issue to those who understand blast better on Monday.

dutchscientist commented 1 year ago

And it works absolutely fine if I leave out the -O Salmonella out, the only thing I am missing then is the point mutation resistances. But I can do those with pointfinder.

Thanks for your help!

evolarjun commented 1 year ago

I thought it might be due to the blast in bioconda, which applies several minor patches to the blast source, but I wasn't able to reproduce the issue with two versions of blast from bioconda. Still a mystery to me.

vbrover commented 1 year ago

You can download NCBI BLAST from https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ and use the amrfinder option --blast_bin BLAST_DIR.

dutchscientist commented 1 year ago

Thanks! It did notice that each time I re-ran the failed ones, a few would do it suddenly (say 5% of the samples), and then I moved to another virtual computer and all the remaining samples worked fine. I am about to do another big batch again, will try this and report back.

dutchscientist commented 1 year ago

I have reformatted the headers and files with SeqFu (https://github.com/telatin/seqfu2): fu-multirelabel -r genomename -n genome-00001.fna --no-comments > genome00001.fna (I previously used Prokka-generated FASTA files)

I still use BLAST+ 2.13.0, but now I have not had dropouts anymore, except for 1 genome that ran fine when done again. All the previous "problem makers" like Salm000048.fna ran absolutely fine.

The only difference I can seen between the Prokka- and Seqfu-generated files is that with Prokka it's 60 bases per line (like Genbank downloads), whereas with Seqfu everything is on a single line, no line breaks per contig.

Anyway, just ran 40k of the 50k Salmonella genomes without a hiccup (still running), so problem seems to have been resolved. Happy to close it, thanks for the assistance!

evolarjun commented 1 year ago

I'm glad you got it working!

Thanks for the clue about line length, and thanks for your patience. We'll take a look and see if we can figure anything out. At least we have a potential fix if we hear of other people having the issue and a clue as to what could be breaking.

Thanks again for reporting and giving us all the details.