theislab / ehrapy

Electronic Health Record Analysis with Python.
https://ehrapy.readthedocs.io/
Apache License 2.0
235 stars 19 forks source link

Make all imputation methods consistent in regard to encoding requirements #824

Open nicolassidoux opened 1 week ago

nicolassidoux commented 1 week ago

Description of feature

Imputation requires proper encoding of the data, but at the moment, there is no consistent strategy about what to do if a passed AnnData is not encoded. Some methods throw an exception, while some others silently encode the data with an arbitrary method.

It would be probably better to just throw an exception and let the caller decide what to do instead of altering the data.

Function Encoding required Action if not encoded
explicit_impute No None
simple_impute Strategy-dependant Raise ValueError
knn_impute Yes Perform ordinal encoding
miss_forest_impute Yes Perform ordinal encoding
mice_forest_impute Yes Perform ordinal encoding

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

nicolassidoux commented 1 week ago

Comments so far (note: I haven’t tested anything yet)

  1. Imports from protected

    • I noticed that prepocessing/_imputation.py contains imports from protected modules or functions:

      from ehrapy.anndata.anndata_ext import _get_column_indices  
      • _get_column_indices is primarily used outside its module and could be valuable to the public API. Since the documentation is already in place, I suggest making it public.
      from ehrapy.core._tool_available import _check_module_importable  
      • _check_module_importable is only used (outside its module) in prepocessing/_imputation.py to check for the presence of sklearnex. If conditional execution based on this is truly necessary, I propose making _check_module_importable public across the entire ehrapy module.

      • For consistency, I also recommend making _shell_command_accessible module-public. To align naming conventions, I renamed _tool_available to _utils_available to match _utils_doc (previously _doc_util).

  2. Return Value Behavior

    • Some functions return an AnnData object only if the copy argument is set. I believe this is a mistake. The hints specifices the functions should consistently return the AnnData object passed to them, regardless of whether it’s copied at the start.
  3. Use of Progress

    • I noticed some Progress are being used, but as far as I know, we don’t have any way to track the progress of imputations. To simplify the code, I recommend removing these.
  4. Improvements to _get_non_numerical_column_indices (prepocessing/_imputation.py)

    • I rewrote _is_float_or_nan to eliminate redundancy
    • I added _is_float_or_nan_row to avoid the need for np.vectorize. This change also resolves the type mismatch warnings caused by the use of np.apply_along_axis with the result of np.vectorize.
  5. Documentation

    • I’m generally skeptical about documenting the exceptions a function can raise unless enforced by a compiler (e.g., in Java).

      • If we want to document all possible exceptions, we’d have to include everything that could be raised by nested calls, which is practically impossible to maintain—especially when external libraries are involved.
      • If we don’t want to do this, what’s the point of an incomplete list?
    • In my opinion, documenting exceptions is only relevant in two cases:

      1. The function itself handles all exceptions from its calls and raises a limited, well-defined set of exceptions.
      2. The function raises custom exceptions that users might need to catch specifically.
nicolassidoux commented 4 days ago

I’ve updated the branch following my tests.

In test_imputation.py, I added a _base_check_imputation function designed to ensure that every imputation test meets the following criteria:

Additionally, I wrote dedicated tests to validate the _base_check_imputation function itself. Test-ception, if you will.

Let me know your thoughts before I proceed with creating a PR.

eroell commented 1 day ago

1.

eroell commented 1 day ago

Regarding the behaviour for non-encoded data:

So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.

Agreed!

eroell commented 1 day ago

Looking forward very much to your PR! :)

nicolassidoux commented 1 day ago

I am not sure what's the Python convention of importing a private module from a different module within the same package. Do you happen to have a link on that?

For what I understood, a package A has access to a private module in another package B if A is lower in the hierarchy (ie B is somehow a parent of A). In our case A tries to get access to a module of C, but C is not a parent of A, hence the warning.

The behavior of functions not altering the shape of the anndata (e.g. encode does change it, but normalizations do not) should be that if copy=False, it alters the data inplace (that is, for example in X) and just returns None, and only if copy=True returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?

Since the hints suggest the function returns an AnnData, so a user could write this: adata = knn_impute(adata, copy=True) In this case adata is now None if knn_impute doesn't return anything. Note that we do this kind of assignments in our tests, to check if ids are equal.