pirovc / ganon

ganon2 classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more
https://pirovc.github.io/ganon/
MIT License
86 stars 13 forks source link

UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

Closed ksavhughes closed 7 months ago

ksavhughes commented 7 months ago

Hi, I'm trying to build a UniVec_Core database. I keep running into this unable to match taxonomy targets error. I also tried to use taxID 28384 for all the univec_core sequences, just to see if that would work (same issue).

I downloaded the fasta file and then made an input file before running the below build command: grep -o '^>[^ ]*' Univec_Core.fasta | sed 's/^>//' | awk '{print "Univec_Core.fasta\t"$1"\t81077"}' > Univec_Core_ganon_input_file.tsv

ganon build-custom -t 20 -n $db/refs/"${cat}"/"${cat}"_ganon_input_file.tsv -d $db/"${cat}"_k19 --level species -k 19 -w 31 -s 4 -v hibf -p 0.001

Error file: Downloading and parsing ncbi taxonomy

Parsing --input-file /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/Univec_Core/Univec_Core_ganon_input_file.tsv

Validating taxonomy

ERROR: Unable to match taxonomy to targets Total elapsed time: 19.32 seconds.

ksavhughes commented 7 months ago

I've also been having some issues building the RefSeq mitochondrion, plasmid, and plastid databases.

mitochondrion:

1 valid file(s) [--input-extension fna.gz, --input-recursive] found in /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/mitochondrion Total valid files: 1

Downloading and parsing ncbi taxonomy

Parsing sequences from --input (1 files)

Retrieving sequence information from NCBI e-utils

Validating taxonomy

Downloading and parsing auxiliary files for genome size estimation

Estimating genome sizes

Building index (raptor) raptor prepare ============= Timings ============= Wall clock time [s]: 7044.54 Peak memory usage [GiB]: 90.9 Compute minimiser [s]: 5924.81 Write minimiser files [s]: 206.09 Write header files [s]: 1.41

raptor layout

raptor build terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc The following command failed to run: /home/karsav1511/.conda/envs/ganon/bin/raptor build --output '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/mitochondrion_k19.hibf' --threads 80 --input '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/mitochondrion_k19_files/build/raptor_layout.binning.out'

Error code: -6

Plasmid and plastid:

10 valid file(s) [--input-extension fna.gz, --input-recursive] found in /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/plasmid Total valid files: 10

Downloading and parsing ncbi taxonomy

Parsing --input (10 files)

Downloading assembly_summary files

Parsing assembly_summary files

Validating taxonomy

ERROR: Unable to match taxonomy to targets Total elapsed time: 123.78 seconds.


I made an input tsv file for the plasmid and plastid files to fix the taxonomy problem, but now it comes back with the same error as the mitochondrion build (and I use --restart in all the build commands now).

Downloading and parsing ncbi taxonomy

Parsing --input-file /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/plastid/plastid_ganon_input_file.tsv

Validating taxonomy

Downloading and parsing auxiliary files for genome size estimation

Estimating genome sizes

Building index (raptor) raptor prepare terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error' what(): filesystem error: cannot get file size: No such file or directory [/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/5854.fna] The following command failed to run: /home/karsav1511/.conda/envs/ganon/bin/raptor prepare --input '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/hibf.txt' --output '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/' --kmer 19 --window 31 --threads 20

Error code: -6

pirovc commented 7 months ago

Regarding the UniVec database, the examples in the documentation are indeed outdated. Follow the commands to build it:

echo -e "UniVec_Core.fasta\tUniVec_Core\t81077" > UniVec_Core_ganon_input_file.tsv
ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves

--level species won't work because the taxid 81077 does not have species in the lineage.

pirovc commented 7 months ago

The same goes for the plasmid, plastid and mitochondrion. Use the following after downloading the files:

mkdir sequences

zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ ">" {accver=(substr($1,2)); print accver}{print $0 > "sequences/"accver".fna"}' | ganon-get-seq-info.sh -e -i - | awk '{print "sequences/"$1".fna\t"$1"\t"$3}' > ppm.tsv

ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 20

rm -rf sequences

you could also build them separately, just change the zcat command to get the files you need.

~The general issue is that --input-target sequence does not work yet with --filter-type hibf, meaning: the input have to be separated by file to be used in the build process, not in bulk like the plasmid sequences. For now I will update the documentation with the new commands but I hope to be able to implement the parse by sequence with hibf soon.~

Fixed in v2.1.0 #285, this works now:

# Download sequence files
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/"

ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence
pirovc commented 7 months ago

Documentation updated in v2.0.1 #281