qiime2 / q2-composition

BSD 3-Clause "New" or "Revised" License
5 stars 28 forks source link

IMP: add character escaping in formula/group parameters for ancombc #100

Closed lizgehret closed 3 months ago

lizgehret commented 1 year ago

The formula parser within ancombc currently doesn't treat (+-/*) as regular characters, because the formula needs to handle those characters as operators. This can be inconvenient for users who have metadata columns that contain those characters, since those characters will act as term separators and will unintentionally split up metadata column names.

Example from @gregcaporaso:

I have a metadata column called "health-status". If it provide that in my forumla, it treats the - as a minus sign (and fails, because I don't have a column called "health"). the full command i'm trying to run is:

qiime composition ancombc --i-table table-mf10000.qza --m-metadata-file sample-metadata.tsv --p-formula "health-status" --o-differentials health-status-ancombc.qza

Character escaping would be nice to add, but not sure if this is possible through the current formula parser. I am currently looking into this and will aim to add this to our 2022.11 patch release if there is a way to handle with current machinery.

gregcaporaso commented 1 year ago

To make this easier to experiment with I am testing now with the moving pictures tutorial sample metadata and feature table. @ebolyen suggested that escaping with back-ticks might work here. That seems to get further, but is still erroring.

First, without back-ticks:

$ qiime composition ancombc --i-table table.qza --m-metadata-file sample-metadata.tsv --p-formula 'body-site' --o-differentials body-site-ancombc.qza --verbose
Traceback (most recent call last):
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-206>", line 2, in ancombc
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 381, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 39, in ancombc
    return _ancombc(
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 104, in _ancombc
    _column_validation(meta, 'formula', term)
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 76, in _column_validation
    raise ValueError('Value provided in the `%s` parameter was not found'
ValueError: Value provided in the `formula` parameter was not found in any of the metadata columns. Please make sure to only include values that are present within the metadata columns.

 Value that was not found as a metadata column: "body"

Plugin error from composition:

  Value provided in the `formula` parameter was not found in any of the metadata columns. Please make sure to only include values that are present within the metadata columns.

   Value that was not found as a metadata column: "body"

See above for debug info.

And next with back-ticks:

$ qiime composition ancombc --i-table table.qza --m-metadata-file sample-metadata.tsv --p-formula '`body-site`' --o-differentials body-site-ancombc.qza --verbose
/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/core/cache.py:435: UserWarning: Your temporary cache was found to be in an inconsistent state. It has been recreated.
  warnings.warn(
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_ancombc.R --inp_abundances_path ./input.biom.tsv --inp_metadata_path ./input.map.txt --formula `body-site` --p_adj_method holm --prv_cut 0.1 --lib_cut 0 --reference_levels  --neg_lb False --tol 1e-05 --max_iter 100 --conserve False --alpha 0.05 --output_loaf /var/folders/jp/_7s6840d2gdf_gkg1291p7x00000gn/T/q2-DataLoafPackageDirFmt-xdnc8oyk

── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0
✔ readr   2.1.3      ✔ forcats 0.5.2
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
R version 4.2.2 (2022-10-31)
'ancombc' is deprecated
Use 'ancombc2' instead
`tax_level` is not speficified
No agglomeration will be performed
Otherwise, please speficy `tax_level` by one of the following:
Species
Error in eval(predvars, data, env) : object 'body-site' not found
Calls: ancombc ... model.matrix.default -> model.frame -> model.frame.default -> eval -> eval
7: eval(predvars, data, env)
6: eval(predvars, data, env)
5: model.frame.default(object, data, xlev = xlev)
4: model.frame(object, data, xlev = xlev)
3: model.matrix.default(formula(paste0("~", formula)), data = meta_data)
2: model.matrix(formula(paste0("~", formula)), data = meta_data)
1: ancombc(data = data, formula = formula, p_adj_method = p_adj_method,
       prv_cut = prv_cut, lib_cut = lib_cut, neg_lb = neg_lb, tol = tol,
       max_iter = max_iter, conserve = conserve, alpha = alpha)
Traceback (most recent call last):
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 153, in _ancombc
    run_commands([cmd])
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 30, in run_commands
    subprocess.run(cmd, check=True)
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['run_ancombc.R', '--inp_abundances_path', './input.biom.tsv', '--inp_metadata_path', './input.map.txt', '--formula', '`body-site`', '--p_adj_method', 'holm', '--prv_cut', '0.1', '--lib_cut', '0', '--reference_levels', '', '--neg_lb', 'False', '--tol', '1e-05', '--max_iter', '100', '--conserve', 'False', '--alpha', '0.05', '--output_loaf', '/var/folders/jp/_7s6840d2gdf_gkg1291p7x00000gn/T/q2-DataLoafPackageDirFmt-xdnc8oyk']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-206>", line 2, in ancombc
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 381, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 39, in ancombc
    return _ancombc(
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 155, in _ancombc
    raise Exception('An error was encountered while running ANCOM-BC'
Exception: An error was encountered while running ANCOM-BC in R (return code 1), please inspect stdout and stderr to learn more.

Plugin error from composition:

  An error was encountered while running ANCOM-BC in R (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.

I get the same error if I don't include the single quotes on the command line, but instead escape the back ticks:

qiime composition ancombc --i-table table.qza --m-metadata-file sample-metadata.tsv --p-formula \`body-site\` --o-differentials body-site-ancombc.qza --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_ancombc.R --inp_abundances_path ./input.biom.tsv --inp_metadata_path ./input.map.txt --formula `body-site` --p_adj_method holm --prv_cut 0.1 --lib_cut 0 --reference_levels  --neg_lb False --tol 1e-05 --max_iter 100 --conserve False --alpha 0.05 --output_loaf /var/folders/jp/_7s6840d2gdf_gkg1291p7x00000gn/T/q2-DataLoafPackageDirFmt-wiab843a

── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0
✔ readr   2.1.3      ✔ forcats 0.5.2
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
R version 4.2.2 (2022-10-31)
'ancombc' is deprecated
Use 'ancombc2' instead
`tax_level` is not speficified
No agglomeration will be performed
Otherwise, please speficy `tax_level` by one of the following:
Species
Error in eval(predvars, data, env) : object 'body-site' not found
Calls: ancombc ... model.matrix.default -> model.frame -> model.frame.default -> eval -> eval
7: eval(predvars, data, env)
6: eval(predvars, data, env)
5: model.frame.default(object, data, xlev = xlev)
4: model.frame(object, data, xlev = xlev)
3: model.matrix.default(formula(paste0("~", formula)), data = meta_data)
2: model.matrix(formula(paste0("~", formula)), data = meta_data)
1: ancombc(data = data, formula = formula, p_adj_method = p_adj_method,
       prv_cut = prv_cut, lib_cut = lib_cut, neg_lb = neg_lb, tol = tol,
       max_iter = max_iter, conserve = conserve, alpha = alpha)
Traceback (most recent call last):
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 153, in _ancombc
    run_commands([cmd])
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 30, in run_commands
    subprocess.run(cmd, check=True)
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['run_ancombc.R', '--inp_abundances_path', './input.biom.tsv', '--inp_metadata_path', './input.map.txt', '--formula', '`body-site`', '--p_adj_method', 'holm', '--prv_cut', '0.1', '--lib_cut', '0', '--reference_levels', '', '--neg_lb', 'False', '--tol', '1e-05', '--max_iter', '100', '--conserve', 'False', '--alpha', '0.05', '--output_loaf', '/var/folders/jp/_7s6840d2gdf_gkg1291p7x00000gn/T/q2-DataLoafPackageDirFmt-wiab843a']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2cli/commands.py", line 352, in __call__
    results = action(**arguments)
  File "<decorator-gen-206>", line 2, in ancombc
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 234, in bound_callable
    outputs = self._callable_executor_(scope, callable_args,
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/qiime2/sdk/action.py", line 381, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 39, in ancombc
    return _ancombc(
  File "/Users/gregcaporaso/miniconda3/envs/qiime2-2022.11/lib/python3.8/site-packages/q2_composition/_ancombc.py", line 155, in _ancombc
    raise Exception('An error was encountered while running ANCOM-BC'
Exception: An error was encountered while running ANCOM-BC in R (return code 1), please inspect stdout and stderr to learn more.

Plugin error from composition:

  An error was encountered while running ANCOM-BC in R (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.

Happy to try something else out - just let me know if I can do anything to help!

(EDIT: Also, I'm not blocked on this as I'm just changing the - to an _ in my metadata file, which is an easy work-around.)

lizgehret commented 1 year ago

Hey @gregcaporaso thanks for this detailed troubleshooting! I did some testing on my end, and I'm seeing the same object 'body-site' not found error that you are when attempting to use back ticks to escape a column containing -. I talked with @ebolyen about this today, and formulaic should allow for special character escaping with back ticks (referenced in the docs here). I'll keep playing around with the formatting and see if I can find something that works!

ebolyen commented 1 year ago

Worth noting that the escaping did seem to work as the error is now: Error in eval(predvars, data, env) : object 'body-site' not found

Which means it passed through our check for metadata, which also means that the column should exist. So something else weird must be going on.

lizgehret commented 1 year ago

Adding notes here for future reference:

The issue is originating from Huang's implementation of ancombc, and occurs because of a default behavior within R. When the phyloseq object is handed into ancombc (or any accepted data input, such as a TreeSummarizedExperiment, etc), the columns containing special characters are replaced with '.' (e.g. so body-site would become body.site). This ends up causing a situation where even though we've successfully character escaped within the formula parameter, ancombc can't find the column name in question (i.e. body-site != body.site). There isn't an easy way to get around this on our end.

A couple of options that @ebolyen and I discussed:

I could submit a PR to Huang's repo, but this would have to be something that he is okay with (and the R user base), since this would require over-riding the default behavior in R of changing special characters to '.'

Another option would be to modify the output on our end, so that we are also changing special characters to '.' - this could be problematic if someone has multiple special characters in their columns though (e.g. 'body-site' and 'body+site' as two separate columns would end up with the same output - 'body.site').

gregcaporaso commented 3 months ago

We're going to consider this a known issue related to interfacing with R, and we're not going to try to fix it.