filter_with_hvg_process returns KeyError for obs_batch_key

Test script pbmc_1k_protein_v3.sh fails in run multisample part with KeyError: 'sample_id' error.

I was trying to run example test code - bash workflows/resources_test_scripts/pbmc_1k_protein_v3.sh. Which executed download/Convert h5mu to h5ad/run single sample parts, but fails at filter_with_hvg_process part.

[4c/f82d3a] process > run_wf:concat:concat_process (pbmc_1k_protein_v3_ums)                   [100%] 1 of 1, cached: 1 ✔
[30/dca996] process > run_wf:normalize_total:normalize_total_process (pbmc_1k_protein_v3_ums) [100%] 1 of 1, cached: 1 ✔
[c9/d696b2] process > run_wf:log1p:log1p_process (pbmc_1k_protein_v3_ums)                     [100%] 1 of 1, cached: 1 ✔
[1f/707a0a] process > run_wf:delete_layer:delete_layer_process (pbmc_1k_protein_v3_ums)       [100%] 1 of 1, cached: 1 ✔
[c9/d96f59] process > run_wf:filter_with_hvg:filter_with_hvg_process (pbmc_1k_protein_v3_ums) [  0%] 0 of 1

Bellow is output before the failure:

Command output:
  2023-02-21 12:18:42,193 INFO     Processing modality 'rna'
  2023-02-21 12:18:42,195 INFO      Unfiltered data: AnnData object with n_obs × n_vars = 713 × 33538
      obs: 'filter_with_counts', 'scrublet_doublet_score', 'filter_with_scrublet'
      var: 'gene_symbol', 'feature_types', 'genome', 'filter_with_counts'
      uns: 'log1p'
      layers: 'log_normalized'
  2023-02-21 12:18:42,196 INFO      Computing hvg

It seems that errors comes from the obs_batch_key optional_parameter that gets passed to scanpy:

  optional_parameters = {
      "max_disp": "max_disp",
      "obs_batch_key": "batch_key",
      "n_top_genes": "n_top_genes"
  }
  # only add parameter if it's passed
  for par_name, dest_name in optional_parameters.items():
      if par.get(par_name):
          hvg_args[dest_name] = par[par_name]

...
...

  try:
      out = sc.pp.highly_variable_genes(**hvg_args)

What could be a solution for this issue? Could it be related to scanpy issues (https://github.com/scverse/scanpy/issues/2396)?

Full output is:

Command exit status:
  1

Command output:
  2023-02-21 12:18:42,193 INFO     Processing modality 'rna'
  2023-02-21 12:18:42,195 INFO      Unfiltered data: AnnData object with n_obs × n_vars = 713 × 33538
      obs: 'filter_with_counts', 'scrublet_doublet_score', 'filter_with_scrublet'
      var: 'gene_symbol', 'feature_types', 'genome', 'filter_with_counts'
      uns: 'log1p'
      layers: 'log_normalized'
  2023-02-21 12:18:42,196 INFO      Computing hvg

Command error:
  WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
  Traceback (most recent call last):
    File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3802, in get_loc
      return self._engine.get_loc(casted_key)
    File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/index.pyx", line 165, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/hashtable_class_helper.pxi", line 5745, in pandas._libs.hashtable.PyObjectHashTable.get_item
    File "pandas/_libs/hashtable_class_helper.pxi", line 5753, in pandas._libs.hashtable.PyObjectHashTable.get_item
  KeyError: 'sample_id'

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File ".viash_script.sh", line 114, in <module>
      out = sc.pp.highly_variable_genes(**hvg_args)
    File "/usr/local/lib/python3.8/site-packages/scanpy/preprocessing/_highly_variable_genes.py", line 456, in highly_variable_genes
      batches = adata.obs[batch_key].cat.categories
    File "/usr/local/lib/python3.8/site-packages/pandas/core/frame.py", line 3807, in __getitem__
      indexer = self.columns.get_loc(key)
    File "/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3804, in get_loc
      raise KeyError(key) from err
  KeyError: 'sample_id'

Hi @PoGibas, Thanks for your interest in OpenPipelines and reaching out! Unfortunatly, the script that you are trying to use are not really meant to be examples on how to use openpipelines per se. In fact, these are scripts that were used to generate test data for the automatic testing of our source code. Some of these scripts were created a long time ago (with older versions of components/workflows), executed once and the resulting data was stored on our S3 bucket. This should be documentated somewhere and we are currently working on improving the quality (and quantity) of our documentation. We should also take care in the future to document which versions of the component were used in these scripts (and/or dockerize them) so that we can make these scripts reproducable.

That being said, I believe that the issue you encountered is a bug (or at least a logic error) in the pipelines. The filter_with_hvg component expects to retreive a column in the .obs column of the MuData input object that signifies the sample id. This column is added for the full_pipeline (by the component add_id), but not for the subpipelines like rna_multisample. It seems that this column is missing, but it should have been added by a previous component. I will tag this as a bug and open a PR.

A workaround in the script would be:

# Add sample_id
python <<HEREDOC
import mudata as mu
h5mu_data = mu.read_h5mu("${OUT}_uss.h5mu")
h5mu_data.var['sample_id'] = 'pbmc_1k_protein_v3'
h5mu_data.write("${OUT}_uss_with_sample.h5mu")
HEREDOC

# run multisample
NXF_VER=21.10.6 nextflow \
  run . \
  -main-script workflows/multiomics/rna_multisample/main.nf \
  -profile docker \
  --id pbmc_1k_protein_v3_ums \
  --input "${OUT}_uss_with_sample.h5mu" \
  --output "`basename $OUT`_ums.h5mu" \
  --publishDir `dirname $OUT` \
  -resume

Please do not hesistate to reach out again if you need help setting up OpenPipelines for your usecase. In the meantime, have a look at the website to find the workflow that suits your needs.

openpipelines-bio / openpipeline

filter_with_hvg_process returns KeyError for obs_batch_key #300