build-ganon unable to read taxonomy file(s)

jfy133 commented 7 months ago

From nf-core/createtaxdb

(ignore the 'prot' in the nodes/names.dmp files, it is standard taxonomy :) ).

ERROR ~ Error executing process > 'NFCORE_CREATETAXDB:CREATETAXDB:GANON_BUILDCUSTOM (database)'

Caused by:
  Process `NFCORE_CREATETAXDB:CREATETAXDB:GANON_BUILDCUSTOM (database)` terminated with an error exit status (1)

Command executed:

  ganon \
      build-custom \
      --threads 2 \
      --input sarscov2.fasta haemophilus_influenzae.fna.gz \
      --db-prefix database \
      --taxonomy-files prot_names.dmp prot_nodes.dmp \
       \

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_CREATETAXDB:CREATETAXDB:GANON_BUILDCUSTOM":
      ganon: $(echo $(ganon --version 2>1) | sed 's/.*ganon //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  - - - - - - - - - -
     _  _  _  _  _   
    (_|(_|| |(_)| |  
     _|   v. 2.0.0
  - - - - - - - - - -
  Total valid files: 2

  Parsing ncbi taxonomy
  Traceback (most recent call last):
    File "/usr/local/lib/python3.9/site-packages/multitax/ncbitx.py", line 144, in _parse_nodes
      taxid, parent_taxid, rank, _ = line.split('\t|\t', 3)
  ValueError: not enough values to unpack (expected 4, got 1)

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/usr/local/bin/ganon", line 33, in <module>
      sys.exit(load_entry_point('ganon==2.0.0', 'console_scripts', 'ganon')())
    File "/usr/local/lib/python3.9/site-packages/ganon/ganon.py", line 53, in main_cli
      sys.exit(0 if main() else 1)
    File "/usr/local/lib/python3.9/site-packages/ganon/ganon.py", line 36, in main
      ret = build_custom(cfg)
    File "/usr/local/lib/python3.9/site-packages/ganon/build_update.py", line 266, in build_custom
      tax = load_taxonomy(cfg, build_output_folder)
    File "/usr/local/lib/python3.9/site-packages/ganon/build_update.py", line 532, in load_taxonomy
      tax = NcbiTx(files=cfg.taxonomy_files)
    File "/usr/local/lib/python3.9/site-packages/multitax/ncbitx.py", line 15, in __init__
      super().__init__(**kwargs)
    File "/usr/local/lib/python3.9/site-packages/multitax/multitax.py", line 88, in __init__
      self._nodes, self._ranks, self._names = self._parse(
    File "/usr/local/lib/python3.9/site-packages/multitax/ncbitx.py", line 99, in _parse
      nodes, ranks = self._parse_nodes(fhs_list[0])
    File "/usr/local/lib/python3.9/site-packages/multitax/ncbitx.py", line 146, in _parse_nodes
      taxid, parent_taxid, rank, _ = line.decode().split('\t|\t', 3)
  AttributeError: 'str' object has no attribute 'decode'

Work dir:
  /home/james/git/nf-core/createtaxdb/testing/work/be/3fd5f53d1012b339f46f1f87ff0b25

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/createtaxdb] Pipeline completed with errors-

The relevant files:

sarscov2 = https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta influenzae = https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz nodesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_nodes.dmp' namesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_names.dmp'

pirovc commented 7 months ago

There are two problems. First the file order in --taxonomy-files should be: nodes, names [, merged] or one .tar.gz file containing standard NCBI file names (nodes.dmp names.dmp merged.dmp). In your example: --taxonomy-files prot_nodes.dmp prot_names.dmp will work. The second issue is that the prot_names.dmp has a not expected empty line causing the parsing error.

I will try to document better the taxonomy file ordering and skip empty lines when parsing the taxonomy in the next release.

pirovc commented 7 months ago

I also notice that this small example will not build properly. There are no assembly accession information in the file names, which is expected by default. The easiest in this case is to generate a file linking each file to the taxonomic target and use it with --input-file, example:

haemophilus_influenzae.fna.gz   haemophilus_influenzae  727
sarscov2.fasta  sarscov2    2697049

pirovc commented 7 months ago

A longer explanation, since this already appeared before #277

Before version 2.0.0, you could use one or more files in any format and with --input-target sequence ganon would extract the sequence accession from all files and automatically retrieve taxonomy and build the database. In the newer versions (>=2.0.0), --input-target sequence is not possible due to way the new filter is built. However, parsing by sequence still works with the old filter --filter-type ibf, but you don't get the benefits of the new filter.

That's why the creating the --input-file is the best solution for now. You could also rename your inputs for NCBI assembly file formats, e.g. GCF_900478275.1.fna.gz instead of haemophilus_influenzae.fna.gz and GCA_011545545.1.fasta instead of sarscov2.fasta.

I'm working in a solution for this in ganon2 to bring back the old functionality.

jfy133 commented 7 months ago

Thanks very much @pirovc for the detailed explanations! I will attempt to update the nf-core module and dataset accordingly (where the original failures came from).

I'll let you close this issue when you're ready (e.g. if you want to keep it open until you've updated the documentation).

jfy133 commented 6 months ago

@pirovc ~~I'm working on this, regarding the prot_names.dmp, i don't see any empty line myself when I inspect the file, do you mean the empty column?~~ Ok ignore that, all the viewers I'm using are stripping it without telling me :sweat_smile:

jfy133 commented 6 months ago

I also realise I'm not really following what the input-file file columns is meant to contain...

From here: https://pirovc.github.io/ganon/custom_databases/#non-standardcustom-accessions

I understand I can have 3 or 5 columns:

a fasta filename
a sequence accession in the header of the FASTA (a.k.a a target)
the taxonomy ID
the official species name(?)
the strain ID of the species

I don't think I'm following the terminology of 'target' and 'specialization' based on the information on that page...I al

Following your example above of

I also notice that this small example will not build properly. There are no assembly accession information in the file names, which is expected by default. The easiest in this case is to generate a file linking each file to the taxonomic target and use it with --input-file, example:
haemophilus_influenzae.fna.gz haemophilus_influenzae  727
sarscov2.fasta    sarscov2    2697049

I get the following error:

Parsing --input-file input-file.txt
Traceback (most recent call last):
  File "/home/james/bin/miniconda3/envs/ganon/bin/ganon", line 33, in <module>
    sys.exit(load_entry_point('ganon==2.0.0', 'console_scripts', 'ganon')())
  File "/home/james/bin/miniconda3/envs/ganon/lib/python3.10/site-packages/ganon/ganon.py", line 53, in main_cli
    sys.exit(0 if main() else 1)
  File "/home/james/bin/miniconda3/envs/ganon/lib/python3.10/site-packages/ganon/ganon.py", line 36, in main
    ret = build_custom(cfg)
  File "/home/james/bin/miniconda3/envs/ganon/lib/python3.10/site-packages/ganon/build_update.py", line 269, in build_custom
    info = load_input(cfg, input_files)
  File "/home/james/bin/miniconda3/envs/ganon/lib/python3.10/site-packages/ganon/build_update.py", line 485, in load_input
    info = parse_input_file(cfg.input_file, info, cfg.input_target)
  File "/home/james/bin/miniconda3/envs/ganon/lib/python3.10/site-packages/ganon/build_update.py", line 460, in parse_input_file
    info = pd.read_csv(input_file,
  File "/home/james/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/home/james/.local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/james/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/james/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    return parser.read(nrows)
  File "/home/james/.local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/james/.local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 952, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 5 and found 3

Which is confusing: it needs 5, but I've given 3, but there are two many columns?

I also get the same error when I specify a similar file, but with the sequence accession ID here:

genome.fasta    MT192765.1      2697049

If I extend to what I presume is a 5 column file:

genome.fasta    MT192765.1      2697049 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/PC00101P/2020

Then I get

Building index (raptor)
raptor prepare
The following command failed to run:
/home/james/bin/miniconda3/envs/ganon/bin/raptor prepare --input 'test_files/build/hibf.txt' --output 'test_files/build/' --kmer 19 --window 31 --quiet --threads 2

Error code: 1

jfy133 commented 6 months ago

The test files from the issues above

ganon-buildcustom-problems.zip

pirovc commented 6 months ago

I see that it's a bit confusing, the --input-file is used to perform many different inputs for ganon build-custom. Don't worry about the sequence headers (MT192765.1), they are not used in file mode. --input-target file tells ganon that each input file is one unit (e.g. multi-fasta files are considered one input, sequence headers ignored).

target here is just a name for your file. You could use the filename itself: genome.fasta <tab> genome.fasta <tab> 2697049. The file (named with target) will be another unique level on the taxonomy below the given node, depending on the parameter --level:

--level file (default) -> use the file (named with target) as a tax. node with 2697049 as parent --level leaves -> files are grouped by leaf nodes (2697049) --level species -> files are grouped by species (694009) --level genus -> ...

Specialization (cols. 4 and 5) are only needed if you want to create a specialized taxonomic level with a custom name, with the option to group files under this node. Example:

genome.fasta <tab> genome.fasta <tab> 2697049 <tab> MyTaxID <tab> MyCustomNameForTheNode
genome2.fasta <tab> genome2.fasta <tab> 2697049 <tab> MyTaxID <tab> MyCustomNameForTheNode

--level custom -> use the specialization (MyTaxID) as a tax. node with 2697049 as parent.

When --input-target sequence the columns are used slightly differently. I will improve the docs on this part, it can get very confusing with so many option.

pirovc commented 6 months ago

Answering the number of columns, they are all optional, with the following behavior:

file [<tab> target <tab> node <tab> specialization <tab> specialization_name]

1 col.: each file will be parsed independently, target will be filename, no taxonomy
2 cols: same as 1 with target named by second col. but not used
3 cols: same as 2 with taxonomy
4 cols: same as 3 with specialization
5 cols: same as 4 with specialization name

~~I could not replicate the bug, the files you sent are incomplete with just the symbolic links.~~ I actually managed to replicate it and a fix is underway with better documentation and --input-target sequence for any filter type.

pirovc commented 6 months ago

Most issues fixed in v2.1.0. The empty line bug will be fixed later on multitax. Let me know if something is still unclear about the --input-file in the updated docs: https://pirovc.github.io/ganon/custom_databases/ However, if your files are NCBI header standard you can now use --input-target sequence and skip the --input-file creation.

jfy133 commented 6 months ago

Great thank you very much @pirovc ! I will be continuing this on Thursday, will review the docs etc.

jfy133 commented 6 months ago

I saw the bioconda recipe hasn't been updated yet so I didn't continue today. I'm travelling for a month from next week, so will have to provide feedback (if any needed!) then :)

pirovc / ganon

build-ganon unable to read taxonomy file(s) #282