Custom DB - Githubissues

Sanrrone commented 1 year ago

Dears, Along with greeting you, I have hundred of sequences with no accession ID since they are denovo assembled, but we know the corresponding taxID. How can I build my custom DB without a acc ID? is possible to give ganon the names.dmp and nodes.dmp directly? I cannot find the correct parameter to that.

thanks in advance!

pirovc commented 1 year ago

You can do it with ganon build-custom. You'll have to create a tab-separated file (--input-file) with the fields: filepath <tab> sequence header <tab> taxid. Here you find some examples.

Yes it's possible to give nodes.dmp and names.dmp directely with --taxonomy-files nodes.dmp names.dmp

Let me know if you have any trouble with it

Sanrrone commented 1 year ago

the command works, Unfortunately I'm getting this error

- - - - - - - - - -
   _  _  _  _  _   
  (_|(_|| |(_)| |  
   _|   v. 1.5.0
- - - - - - - - - -
Parsing ncbi taxonomy
 - done in 0.23s.

Parsing --input-file ganondbinput.tsv
 - 10432883 unique entries
 - done in 21.33s.

Validating taxonomy
 - done in 3.27s.

Downloading and parsing auxiliary files for genome size estimation
 - done in 1.25s.

Estimating genome sizes
 - done in 0.31s.

Building index (ganon-build)
The following co
mmand failed to run:
/scratch/project_2007362/software/mambaforge/bin/ganon-build --input-file 'HumGut_files/build/target_info.tsv' --output-file 'HumGut.ibf' --kmer-size 19 --window-size 31 --hash-functions 4 --mode avg --max-fp 0.01  --tmp-output-folder 'HumGut_files/build/' --threads 16  

Error code: -9

which could it means?

pirovc commented 1 year ago

could you please post here some lines of your ganondbinput.tsv so I can debug it?

Sanrrone commented 1 year ago

Sure,

$ tail ganondbinput.tsv 
HumGut.fna  HumGut_1541_123 GUT_GENOME286991_123    3001541
HumGut.fna  HumGut_1541_124 GUT_GENOME286991_124    3001541
HumGut.fna  HumGut_1541_125 GUT_GENOME286991_125    3001541
HumGut.fna  HumGut_1541_126 GUT_GENOME286991_126    3001541
HumGut.fna  HumGut_1541_127 GUT_GENOME286991_127    3001541
HumGut.fna  HumGut_1541_128 GUT_GENOME286991_128    3001541
HumGut.fna  HumGut_1541_129 GUT_GENOME286991_129    3001541
HumGut.fna  HumGut_1541_130 GUT_GENOME286991_130    3001541
HumGut.fna  HumGut_1541_131 GUT_GENOME286991_131    3001541
HumGut.fna  HumGut_1541_132 GUT_GENOME286991_132    3001541

Also I'm running the verbose mode now just in case.

pirovc commented 1 year ago

It would be also useful to have some lines from HumGut_files/build/target_info.tsv and the command used

Sanrrone commented 1 year ago

$ tail HumGut_files/build/target_info.tsv
HumGut.fna  HumGut_1541_123 GUT_GENOME286991_123
HumGut.fna  HumGut_1541_124 GUT_GENOME286991_124
HumGut.fna  HumGut_1541_125 GUT_GENOME286991_125
HumGut.fna  HumGut_1541_126 GUT_GENOME286991_126
HumGut.fna  HumGut_1541_127 GUT_GENOME286991_127
HumGut.fna  HumGut_1541_128 GUT_GENOME286991_128
HumGut.fna  HumGut_1541_129 GUT_GENOME286991_129
HumGut.fna  HumGut_1541_130 GUT_GENOME286991_130
HumGut.fna  HumGut_1541_131 GUT_GENOME286991_131
HumGut.fna  HumGut_1541_132 GUT_GENOME286991_132

I'm checking other logs and it seems to be a memory limit issue, I'll try a node with more ram memory. thanks you for the help!

pirovc commented 1 year ago

Alright. I would suggest to give the full path of the fasta file HumGut.fna in the ganondbinput.tsv

pirovc commented 1 year ago

Another thing that may be causing the issue: if all your sequences are in on big fasta file, you should use --input-target sequence to tell ganon to build by sequence not by file. Note that you have to --restart to overwrite the files already written.

Sanrrone commented 1 year ago

Dear pirovc, after some days of tests, build a database was impossible due to time computing and the amount of ram needed. I tried 700GB of RAM in a server which allow a maximum of 3 days for that amount of RAM memory. 700GB is enough but, after 3 days the build was cancelled due to time limit. Is there a way to reduce the amount of time needed? I was trying to index the Human Gut DB (https://arken.nmbu.no/~larssn/humgut/) which is a ~60GB of genomic sequences from human gut. Feel free to close the issue due the main question is about the command to build a DB which we know it works :+1: PS: I used 32 cores with no success.

pirovc commented 1 year ago

ganon should not take more than some hours to build such a database and 700GB sounds way too much. The only thing I can imagine is that your disk is on a network or is very slow? ganon writes lots of temporary files during the build process and that could be causing the issue.

I gonna try to replicate this database here to see what else could be wrong. Could you please send me the exact ganon build-custom used? If by any chance you ran the command with --verbose you could also paste the output here.

Sanrrone commented 1 year ago

Hi, I don't think the disks are slow, they are made for fast I/O. this is the command:

ganon build-custom --input-file ganondbinput.tsv -d HumGut -p 0.01 --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp -t 32 --input-target sequence --mode faster --restart --verbose

and this the output:

- - - - - - - - - -
   _  _  _  _  _   
  (_|(_|| |(_)| |  
   _|   v. 1.5.0
- - - - - - - - - -
Parsing ncbi taxonomy
 - done in 0.68s.

Parsing --input-file ganondbinput.tsv
 - 10432883 unique entries
 - done in 25.09s.

Validating taxonomy
 - done in 3.19s.

Downloading and parsing auxiliary files for genome size estimation
 - done in 2.91s.

Estimating genome sizes
 - done in 0.31s.

Building index (ganon-build)
----------------------------------------------------------------------
--input-file        HumGut_files/build/target_info.tsv
--output-file       HumGut.ibf
--tmp-output-folder HumGut_files/build/
--max-fp            0.01
--filter-size       0
--kmer-size         19
--window-size       31
--hash-functions    4
--mode              faster
--threads           32
--verbose           1
--quiet             0
----------------------------------------------------------------------
slurmstepd: error: *** JOB 16171183 ON r07c57 CANCELLED AT 2023-04-21T21:24:32 DUE TO TIME LIMIT ***

pirovc commented 1 year ago

I see now, building by sequence really exploded (>10M sequences) since you merged all files into one (HumGut.fna). In this case, you can speed-up the build by using the individual reference files directly, so ganon can better parallelize the whole process. Below the commands I used:

wget http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz
tar xf HumGut.tar.gz 

wget https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp
wget https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp
wget http://arken.nmbu.no/~larssn/humgut/HumGut.tsv
tail -n+2 HumGut.tsv | awk -F"\t" '{print "fna/"$21"\t"$1"\t"$2}' > ganondbinput.tsv

head ganondbinput.tsv
    fna/GUT_GENOME080427.fna.gz HumGut_1    3000001
    fna/GUT_GENOME228281.fna.gz HumGut_2    3000002
    fna/GUT_GENOME088923.fna.gz HumGut_3    3000003
    fna/GUT_GENOME069448.fna.gz HumGut_4    3000004
    fna/GUT_GENOME076412.fna.gz HumGut_5    3000005
    fna/GUT_GENOME071646.fna.gz HumGut_6    3000006
    fna/GUT_GENOME072460.fna.gz HumGut_7    3000007
    fna/GUT_GENOME087869.fna.gz HumGut_8    3000008
    fna/GUT_GENOME229217.fna.gz HumGut_9    3000009
    fna/GUT_GENOME086324.fna.gz HumGut_10   3000010

ganon build-custom --input-file ganondbinput.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --threads 32 --max-fp 0.01 --verbose

The build took 20 minutes in my machine with 32 threads and needed around 24GB of RAM (with --mode avg, --mode faster may need a bit more). Note that --mode will only affect classification speed using the database, not the build itself.

Please let me know if that works for you. This is a nice example of custom database, I will add it to the documentation.

Sanrrone commented 1 year ago

amazing, now it ends in 440 secs. thank you very much! problem completely solved

pirovc / ganon

Custom DB #240