nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

ENH(clades): Validate `clades.tsv`, bad file currently throws uncaught error #1234

Open corneliusroemer opened 1 year ago

corneliusroemer commented 1 year ago

Context

When the clades.tsv is messed up, augur 22.0.1 throws an unclear error: KeyError: 'gene'

Description

Let's catch it and report that something seems up with the clades.tsv

Current traceback:

$ augur clades --tree builds/wuhan/tree.nwk             --mutations builds/wuhan/nt_muts.json builds/wuhan/aa_muts.json             --clades builds/wuhan/clades_nextstrain.tsv             --output-node-data builds/wuhan/clades_nextstrain.tmp
Validating schema of 'builds/wuhan/nt_muts.json'...
Validating schema of 'builds/wuhan/aa_muts.json'...
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'gene'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/augur/__init__.py", line 66, in run
    return args.__command__.run(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/augur/clades.py", line 360, in run
    clade_designations = read_in_clade_definitions(args.clades)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/augur/clades.py", line 68, in read_in_clade_definitions
    clade_inheritance_rows = df[df['gene'] == 'clade']
                                ~~^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/pandas/core/frame.py", line 3807, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniforge/base/envs/py11/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
    raise KeyError(key) from err
KeyError: 'gene'

An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
    <https://github.com/nextstrain/augur/issues/new/choose>
corneliusroemer commented 1 year ago

Just spent quite some time debugging this yet again. I should really take care of this issue, if it would save me time twice, it must be useful for others too in the future.

Already a simple header check would be super useful: there must be at least columns x/y/z present otherwise error