Open nicolassidoux opened 1 week ago
Imports from protected
I noticed that prepocessing/_imputation.py
contains imports from protected modules or functions:
from ehrapy.anndata.anndata_ext import _get_column_indices
_get_column_indices
is primarily used outside its module and could be valuable to the public API. Since the documentation is already in place, I suggest making it public. from ehrapy.core._tool_available import _check_module_importable
_check_module_importable
is only used (outside its module) in prepocessing/_imputation.py
to check for the presence of sklearnex. If conditional execution based on this is truly necessary, I propose making _check_module_importable
public across the entire ehrapy module.
For consistency, I also recommend making _shell_command_accessible
module-public. To align naming conventions, I renamed _tool_available
to _utils_available
to match _utils_doc
(previously _doc_util
).
Return Value Behavior
AnnData
object only if the copy
argument is set. I believe this is a mistake. The hints specifices the functions should consistently return the AnnData
object passed to them, regardless of whether it’s copied at the start. Use of Progress
Progress
are being used, but as far as I know, we don’t have any way to track the progress of imputations. To simplify the code, I recommend removing these. Improvements to _get_non_numerical_column_indices
(prepocessing/_imputation.py
)
_is_float_or_nan
to eliminate redundancy_is_float_or_nan_row
to avoid the need for np.vectorize
. This change also resolves the type mismatch warnings caused by the use of np.apply_along_axis
with the result of np.vectorize
. Documentation
I’m generally skeptical about documenting the exceptions a function can raise unless enforced by a compiler (e.g., in Java).
In my opinion, documenting exceptions is only relevant in two cases:
I’ve updated the branch following my tests.
In test_imputation.py
, I added a _base_check_imputation
function designed to ensure that every imputation test meets the following criteria:
Additionally, I wrote dedicated tests to validate the _base_check_imputation
function itself. Test-ception, if you will.
Let me know your thoughts before I proceed with creating a PR.
1.
encode
does change it, but normalizations do not) should be that if copy=False
, it alters the data inplace (that is, for example in X) and just returns None
, and only if copy=True
returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?
3.Regarding the behaviour for non-encoded data:
So I suggest to modify knn_impute, miss_forest_impute and mice_forest_impute.
Agreed!
Looking forward very much to your PR! :)
I am not sure what's the Python convention of importing a private module from a different module within the same package. Do you happen to have a link on that?
For what I understood, a package A has access to a private module in another package B if A is lower in the hierarchy (ie B is somehow a parent of A). In our case A tries to get access to a module of C, but C is not a parent of A, hence the warning.
The behavior of functions not altering the shape of the anndata (e.g. encode does change it, but normalizations do not) should be that if copy=False, it alters the data inplace (that is, for example in X) and just returns None, and only if copy=True returns a new anndata, with a copied, modified array. Does this make sense to you? Or you have something in mind why it should always return the anndata anyways?
Since the hints suggest the function returns an AnnData
, so a user could write this:
adata = knn_impute(adata, copy=True)
In this case adata is now None
if knn_impute doesn't return anything. Note that we do this kind of assignments in our tests, to check if ids are equal.
Description of feature
Imputation requires proper encoding of the data, but at the moment, there is no consistent strategy about what to do if a passed AnnData is not encoded. Some methods throw an exception, while some others silently encode the data with an arbitrary method.
It would be probably better to just throw an exception and let the caller decide what to do instead of altering the data.
explicit_impute
simple_impute
knn_impute
miss_forest_impute
mice_forest_impute
So I suggest to modify
knn_impute
,miss_forest_impute
andmice_forest_impute
.