outbrain / outrank

A Python library for efficient feature ranking and selection on sparse data sets.
https://dl.acm.org/doi/10.1145/3604915.3610636
BSD 3-Clause "New" or "Revised" License
19 stars 3 forks source link

MI-numba-3mr ranking throws an error on missing relational scores #44

Closed miha-jenko closed 1 year ago

miha-jenko commented 1 year ago

Running with latest release (0.94.1).

Two errors I have encountered. Using target ranking with default combination limit:

outrank \
    --task all \
    --data_path data_path/ \
    --data_source ob-vw \
    --subsampling 10 \
    --label_column label \
    --heuristic MI-numba-3mr \
    --include_noise_baseline_features True \
    --interaction_order 1 \
    --transformers none \
    --target_ranking_only True \
    --output_folder output_path/

Error:

Traceback (most recent call last):
  File "bin/outrank", line 8, in <module>
    sys.exit(main())
  File "lib/python3.8/site-packages/outrank/__main__.py", line 246, in main
    outrank_task_conduct_ranking(args)
  File "lib/python3.8/site-packages/outrank/task_ranking.py", line 233, in outrank_task_conduct_ranking
    mrmrmr_ranking = rank_features_3MR(
  File "lib/python3.8/site-packages/outrank/algorithms/importance_estimator.py", line 175, in rank_features_3MR
    feature_relation = calc_higher_order(feat, False)
  File "lib/python3.8/site-packages/outrank/algorithms/importance_estimator.py", line 162, in calc_higher_order
    values.append(relational_dict[(feat, feature)])
KeyError: ('CONTROL-target', 'feature_X')

Also, using suggested combination limits and higher subsampling:

outrank \
    --task all \
    --data_path data_path/ \
    --data_source ob-vw \
    --subsampling 300 \
    --label_column label \
    --heuristic MI-numba-3mr \
    --include_noise_baseline_features True \
    --interaction_order 1 \
    --transformers none \
    --target_ranking_only True \
    --combination_number_upper_bound 2048 \
    --output_folder output_path/

Error:

Traceback (most recent call last):
  File "lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "lib/python3.8/site-packages/multiprocess/pool.py", line 48, in mapstar
    return list(map(*args))
  File "lib/python3.8/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "outrank/outrank/core_ranking.py", line 126, in get_grounded_importances_estimate
    return get_importances_estimate_pairwise(combination, args, tmp_df=tmp_df)
  File "outrank/outrank/algorithms/importance_estimator.py", line 102, in get_importances_estimate_pairwise
    vector_first = tmp_df[[feature_one]].values.ravel()
  File "lib/python3.8/site-packages/pandas/core/frame.py", line 3767, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
  File "lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5938, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['feature_Y AND_REL feature_Z'], dtype='object')] are in the [columns]"

The errors seems to suggest the combination limit produces a combinatorial set which is not being respected when retrieving scores.

As a bonus, @SkBlaz suggested we could be warning users upfront which combinations will not be calculated.

miha-jenko commented 1 year ago

Retesting 0.95 with a larger dataset, will let you know.

miha-jenko commented 1 year ago

This was fixed.