openpipelines-bio / openpipeline

https://openpipelines.bio
MIT License
25 stars 11 forks source link

metadata_grep_annotation_column is failing in the integrationt tests with `Fractions are not within bounds` #718

Closed DriesSchaumont closed 2 months ago

DriesSchaumont commented 2 months ago

On ref (main): 7d48ed707a295e659bcf0d5f13f4c55ebc967d8d Pipeline process_samples

executor >  local (5)
[14/c13bab] process > test_wf5:move_layer:process... [100%] 1 of 1 ✔
[b8/f333a4] process > test_wf5:process_samples:ru... [100%] 1 of 1 ✔
[4f/a02532] process > test_wf5:process_samples:ru... [100%] 1 of 1 ✔
[-        ] process > test_wf5:process_samples:ru... -
[57/2cec88] process > test_wf5:process_samples:ru... [100%] 1 of 1, failed: 1 ✘
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
[-        ] process > test_wf5:process_samples:ru... -
After splitting modalities: [5k_human_antiCMV_T_TBNK_select_layer, [output:/home/runner/work/openpipeline/openpipeline/work/4f/a02532cc5ae05cb51deaa5f08fed4e/5k_human_antiCMV_T_TBNK_select_layer.split_modalities_component.output, highly_variable_features_var_output:filter_with_hvg, highly_variable_features_obs_batch_key:sample_id, mitochondrial_gene_regex:^[mM][tT]-, top_n_vars:[50, 100, 200, 500], pca_overwrite:false, id:5k_human_antiCMV_T_TBNK_select_layer, input:/home/runner/work/openpipeline/openpipeline/work/4f/a02532cc5ae05cb51deaa5f08fed4e/5k_human_antiCMV_T_TBNK_select_layer.split_modalities_component.output/5k_human_antiCMV_T_TBNK_select_layer.add_id.output_rna.h5mu, rna_layer:test_layer, rna_min_counts:2, rna_max_counts:1000000, rna_min_genes_per_cell:1, rna_max_genes_per_cell:1000000, rna_min_cells_per_gene:1, rna_min_fraction_mito:0.0, rna_max_fraction_mito:1.0, var_name_mitochondrial_genes:mitochondrial, obs_name_mitochondrial_fraction:fraction_mitochondrial, workflow_output:$id.$key.output.h5mu, var_qc_metrics:filter_with_hvg,mitochondrial, modality:rna]]
After splitting modalities: [5k_human_antiCMV_T_TBNK_select_layer, [output:/home/runner/work/openpipeline/openpipeline/work/4f/a02532cc5ae05cb51deaa5f08fed4e/5k_human_antiCMV_T_TBNK_select_layer.split_modalities_component.output, highly_variable_features_var_output:filter_with_hvg, highly_variable_features_obs_batch_key:sample_id, mitochondrial_gene_regex:^[mM][tT]-, top_n_vars:[50, 100, 200, 500], pca_overwrite:false, id:5k_human_antiCMV_T_TBNK_select_layer, input:/home/runner/work/openpipeline/openpipeline/work/4f/a02532cc5ae05cb51deaa5f08fed4e/5k_human_antiCMV_T_TBNK_select_layer.split_modalities_component.output/5k_human_antiCMV_T_TBNK_select_layer.add_id.output_gdo.h5mu, rna_layer:test_layer, rna_min_counts:2, rna_max_counts:1000000, rna_min_genes_per_cell:1, rna_max_genes_per_cell:1000000, rna_min_cells_per_gene:1, rna_min_fraction_mito:0.0, rna_max_fraction_mito:1.0, var_name_mitochondrial_genes:mitochondrial, obs_name_mitochondrial_fraction:fraction_mitochondrial, workflow_output:$id.$key.output.h5mu, var_qc_metrics:filter_with_hvg,mitochondrial, modality:gdo]]
WARN: Key for module 'grep_annotation_column' is duplicated.

WARN: Key for module 'calculate_qc_metrics' is duplicated.

WARN: Key for module 'publish' is duplicated.

WARN: Key for module 'pca' is duplicated.

WARN: Key for module 'find_neighbors' is duplicated.

WARN: Key for module 'umap' is duplicated.

Error executing process > 'test_wf5:process_samples:run_wf:runEachWf:rna_singlesample:run_wf:qc:run_wf:grep_annotation_column:processWf:grep_annotation_column_process (5k_human_antiCMV_T_TBNK_select_layer)'

Caused by:
  Process `test_wf5:process_samples:run_wf:runEachWf:rna_singlesample:run_wf:qc:run_wf:grep_annotation_column:processWf:grep_annotation_column_process (5k_human_antiCMV_T_TBNK_select_layer)` terminated with an error exit status (1)

Command executed:

  # meta exports
  # export VIASH_META_RESOURCES_DIR="/home/runner/work/openpipeline/openpipeline/target/nextflow/metadata/grep_annotation_column"
  export VIASH_META_RESOURCES_DIR=".viash_meta_resources"
  export VIASH_META_TEMP_DIR="/tmp"
  export VIASH_META_FUNCTIONALITY_NAME="grep_annotation_column"
  # export VIASH_META_EXECUTABLE="$VIASH_META_RESOURCES_DIR/$VIASH_META_FUNCTIONALITY_NAME"
  export VIASH_META_CONFIG="$VIASH_META_RESOURCES_DIR/.config.vsh.yaml"
  export VIASH_META_CPUS=1
  export VIASH_META_MEMORY_B=5368709120
  if [ ! -z ${VIASH_META_MEMORY_B+x} ]; then
    export VIASH_META_MEMORY_KB=$(( ($VIASH_META_MEMORY_B+1023) / 1024 ))
    export VIASH_META_MEMORY_MB=$(( ($VIASH_META_MEMORY_KB+1023) / 1024 ))
    export VIASH_META_MEMORY_GB=$(( ($VIASH_META_MEMORY_MB+1023) / 1024 ))
    export VIASH_META_MEMORY_TB=$(( ($VIASH_META_MEMORY_GB+1023) / 1024 ))
    export VIASH_META_MEMORY_PB=$(( ($VIASH_META_MEMORY_TB+1023) / 1024 ))
  fi

  # meta synonyms
  export VIASH_TEMP="$VIASH_META_TEMP_DIR"
  export TEMP_DIR="$VIASH_META_TEMP_DIR"

  # create output dirs if need be
  function mkdir_parent {
    for file in "$@"; do 
      mkdir -p "$(dirname "$file")"
    done
  }
  mkdir_parent "5k_human_antiCMV_T_TBNK_select_layer.grep_annotation_column.output.h5mu"

  # argument exports
  export VIASH_PAR_INPUT="_viash_par/input_1/5k_human_antiCMV_T_TBNK_select_layer.add_id.output_rna.h5mu"
  export VIASH_PAR_INPUT_LAYER="test_layer"
  export VIASH_PAR_MODALITY="rna"
  export VIASH_PAR_MATRIX="var"
  export VIASH_PAR_OUTPUT="5k_human_antiCMV_T_TBNK_select_layer.grep_annotation_column.output.h5mu"
  export VIASH_PAR_OUTPUT_MATCH_COLUMN="mitochondrial"
  export VIASH_PAR_OUTPUT_FRACTION_COLUMN="fraction_mitochondrial"
  export VIASH_PAR_REGEX_PATTERN="^[mM][tT]-"

  # process script
  set -e
  tempscript=".viash_script.sh"
  cat > "$tempscript" << VIASHMAIN
  import mudata as mu
  from pathlib import Path
  from operator import attrgetter, itemgetter
  from pandas import Series
  import re
  import numpy as np

  ### VIASH START
  # The following code has been auto-generated by Viash.
  par = {
    'input': $( if [ ! -z ${VIASH_PAR_INPUT+x} ]; then echo "r'${VIASH_PAR_INPUT//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'input_column': $( if [ ! -z ${VIASH_PAR_INPUT_COLUMN+x} ]; then echo "r'${VIASH_PAR_INPUT_COLUMN//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'input_layer': $( if [ ! -z ${VIASH_PAR_INPUT_LAYER+x} ]; then echo "r'${VIASH_PAR_INPUT_LAYER//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'modality': $( if [ ! -z ${VIASH_PAR_MODALITY+x} ]; then echo "r'${VIASH_PAR_MODALITY//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'matrix': $( if [ ! -z ${VIASH_PAR_MATRIX+x} ]; then echo "r'${VIASH_PAR_MATRIX//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'output': $( if [ ! -z ${VIASH_PAR_OUTPUT+x} ]; then echo "r'${VIASH_PAR_OUTPUT//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'output_compression': $( if [ ! -z ${VIASH_PAR_OUTPUT_COMPRESSION+x} ]; then echo "r'${VIASH_PAR_OUTPUT_COMPRESSION//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'output_match_column': $( if [ ! -z ${VIASH_PAR_OUTPUT_MATCH_COLUMN+x} ]; then echo "r'${VIASH_PAR_OUTPUT_MATCH_COLUMN//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'output_fraction_column': $( if [ ! -z ${VIASH_PAR_OUTPUT_FRACTION_COLUMN+x} ]; then echo "r'${VIASH_PAR_OUTPUT_FRACTION_COLUMN//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'regex_pattern': $( if [ ! -z ${VIASH_PAR_REGEX_PATTERN+x} ]; then echo "r'${VIASH_PAR_REGEX_PATTERN//\'/\'\"\'\"r\'}'"; else echo None; fi )
  }
  meta = {
    'functionality_name': $( if [ ! -z ${VIASH_META_FUNCTIONALITY_NAME+x} ]; then echo "r'${VIASH_META_FUNCTIONALITY_NAME//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'resources_dir': $( if [ ! -z ${VIASH_META_RESOURCES_DIR+x} ]; then echo "r'${VIASH_META_RESOURCES_DIR//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'executable': $( if [ ! -z ${VIASH_META_EXECUTABLE+x} ]; then echo "r'${VIASH_META_EXECUTABLE//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'config': $( if [ ! -z ${VIASH_META_CONFIG+x} ]; then echo "r'${VIASH_META_CONFIG//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'temp_dir': $( if [ ! -z ${VIASH_META_TEMP_DIR+x} ]; then echo "r'${VIASH_META_TEMP_DIR//\'/\'\"\'\"r\'}'"; else echo None; fi ),
    'cpus': $( if [ ! -z ${VIASH_META_CPUS+x} ]; then echo "int(r'${VIASH_META_CPUS//\'/\'\"\'\"r\'}')"; else echo None; fi ),
    'memory_b': $( if [ ! -z ${VIASH_META_MEMORY_B+x} ]; then echo "int(r'${VIASH_META_MEMORY_B//\'/\'\"\'\"r\'}')"; else echo None; fi ),
    'memory_kb': $( if [ ! -z ${VIASH_META_MEMORY_KB+x} ]; then echo "int(r'${VIASH_META_MEMORY_KB//\'/\'\"\'\"r\'}')"; else echo None; fi ),
    'memory_mb': $( if [ ! -z ${VIASH_META_MEMORY_MB+x} ]; then echo "int(r'${VIASH_META_MEMORY_MB//\'/\'\"\'\"r\'}')"; else echo None; fi ),
    'memory_gb': $( if [ ! -z ${VIASH_META_MEMORY_GB+x} ]; then echo "int(r'${VIASH_META_MEMORY_GB//\'/\'\"\'\"r\'}')"; else echo None; fi ),
    'memory_tb': $( if [ ! -z ${VIASH_META_MEMORY_TB+x} ]; then echo "int(r'${VIASH_META_MEMORY_TB//\'/\'\"\'\"r\'}')"; else echo None; fi ),
    'memory_pb': $( if [ ! -z ${VIASH_META_MEMORY_PB+x} ]; then echo "int(r'${VIASH_META_MEMORY_PB//\'/\'\"\'\"r\'}')"; else echo None; fi )
  }
  dep = {

  }

  ### VIASH END

  # START TEMPORARY WORKAROUND setup_logger
  # reason: resources aren't available when using Nextflow fusion
  # from setup_logger import setup_logger
  def setup_logger():
      import logging
      from sys import stdout

      logger = logging.getLogger()
      logger.setLevel(logging.INFO)
      console_handler = logging.StreamHandler(stdout)
      logFormatter = logging.Formatter("%(asctime)s %(levelname)-8s %(message)s")
      console_handler.setFormatter(logFormatter)
      logger.addHandler(console_handler)

      return logger
  # END TEMPORARY WORKAROUND setup_logger
  logger = setup_logger()

  def main(par):
      input_file, output_file, mod_name = Path(par["input"]), Path(par["output"]), par['modality']
      logger.info(f"Compiling regular expression '{par['regex_pattern']}'.")
      try:
          compiled_regex = re.compile(par["regex_pattern"])
      except (TypeError, re.error) as e:
          raise ValueError(f"{par['regex_pattern']} is not a valid regular expression pattern.") from e
      else:
          if compiled_regex.groups:
              raise NotImplementedError("Using match groups is not supported by this component.")
      logger.info('Reading input file %s, modality %s.', input_file, mod_name)

      mudata = mu.read_h5mu(input_file)
      modality_data = mudata[mod_name]
      logger.info("Reading input file done.")
      logger.info("Using annotation dataframe '%s'.", par["matrix"])
      annotation_matrix = getattr(modality_data, par['matrix'])
      default_column = {
          "var": attrgetter("var_names"),
          "obs": attrgetter("obs_names")
      }
      if par["input_column"]:
          logger.info("Input column '%s' was specified.", par["input_column"])
          try:
              annotation_column = annotation_matrix[par["input_column"]]
          except KeyError as e:
              raise ValueError(f"Column {par['input_column']} could not be found for modality "
                              f"{par['modality']}. Available columns:"
                              f" {','.join(annotation_matrix.columns.to_list())}") from e
      else:
          logger.info(f"No input column specified, using '.{par['matrix']}_names'")
          annotation_column = default_column[par['matrix']](modality_data).to_series()
      logger.info("Applying regex search.")
      grep_result = annotation_column.str.contains(par["regex_pattern"], regex=True)
      logger.info("Search results: %s", grep_result.value_counts())

      other_axis_attribute = {
          "var": "obs",
          "obs": "var"
      }
      if par['output_fraction_column']:
          logger.info("Enabled writing the fraction of values that matches to the pattern.")
          input_layer = modality_data.X if not par["input_layer"] else modality_data.layers[par["input_layer"]]
          pct_matching = np.ravel(np.sum(input_layer[:, grep_result], axis=1) / np.sum(input_layer, axis=1))
          assert ((pct_matching >= 0) & (pct_matching <= 1)).all(), \\
                  "Fractions are not within bounds, please report this as a bug"
          logger.info("Fraction statistics: \\n%s", Series(pct_matching).describe())
          output_matrix = other_axis_attribute[par['matrix']]
          logger.info("Writing fractions to matrix '%s', column '%s'",
                      output_matrix, par['output_fraction_column'])
          getattr(modality_data, output_matrix)[par['output_fraction_column']] = pct_matching
      logger.info("Adding values that matched the pattern to '%s', column '%s'",
                  par["matrix"], par["output_match_column"])
      getattr(modality_data, par['matrix'])[par["output_match_column"]] = grep_result
      logger.info("Writing out data to '%s' with compression '%s'.",
                  output_file, par["output_compression"])
      mudata.write(output_file, compression=par["output_compression"])

  if __name__ == "__main__":
      main(par)
  VIASHMAIN
  python -B "$tempscript"

Command exit status:
  1

Command output:
  2024-02-28 03:09:34,533 INFO     Compiling regular expression '^[mM][tT]-'.
  2024-02-28 03:09:34,534 INFO     Reading input file _viash_par/input_1/5k_human_antiCMV_T_TBNK_select_layer.add_id.output_rna.h5mu, modality rna.
  2024-02-28 03:09:34,742 INFO     Reading input file done.
  2024-02-28 03:09:34,742 INFO     Using annotation dataframe 'var'.
  2024-02-28 03:09:34,742 INFO     No input column specified, using '.var_names'
  2024-02-28 03:09:34,742 INFO     Applying regex search.
  2024-02-28 03:09:34,744 INFO     Search results: gene_ids
  False    5594
  Name: count, dtype: int64
  2024-02-28 03:09:34,744 INFO     Enabled writing the fraction of values that matches to the pattern.

Command error:
  Unable to find image 'ghcr.io/openpipelines-bio/metadata_grep_annotation_column:integration_build' locally
  integration_build: Pulling from openpipelines-bio/metadata_grep_annotation_column
  e1caac4eb9d2: Already exists
  51d1f07906b7: Already exists
  fe87ad6b112e: Already exists
  4d8ccb72bbad: Already exists
  8100581c78dd: Already exists
  695bd04ff41a: Pulling fs layer
  747b895a4e23: Pulling fs layer
  695bd04ff41a: Download complete
  695bd04ff41a: Pull complete
  747b895a4e23: Verifying Checksum
  747b895a4e23: Download complete
  747b895a4e23: Pull complete
  Digest: sha256:1c2f660a214ea0c439381ca5ab559be94b037bb2d3a2a80545a0b0901825ea75
  Status: Downloaded newer image for ghcr.io/openpipelines-bio/metadata_grep_annotation_column:integration_build
  .viash_script.sh:104: RuntimeWarning: invalid value encountered in divide
    pct_matching = np.ravel(np.sum(input_layer[:, grep_result], axis=1) / np.sum(input_layer, axis=1))
  Traceback (most recent call last):
    File ".viash_script.sh", line 120, in <module>
      main(par)
    File ".viash_script.sh", line 105, in main
      assert ((pct_matching >= 0) & (pct_matching <= 1)).all(), \
  AssertionError: Fractions are not within bounds, please report this as a bug

Work dir:
  /home/runner/work/openpipeline/openpipeline/work/57/2cec8881b6197de4fd9a464f36209a

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`