Closed Lilly-May closed 6 months ago
Nice, thanks for the summary and binging this together here!
Ideally, we would get rid of the adata.var[EHRAPY_TYPE_KEY] annotation entirely and, if autodetect is set to True in the encoding method, base the identification of features to encode on the annotation from ep.ad.infer_feature_types.
This is not what currently is suggested in #697 right?
Also, I think this could lead to some hard to resolve issues if it is not stored but repeatedly called: e.g. labels often encountered are True/False or 0/1, or yes/no: in the 0/1 case, type inference likely infers that to continuous. And 0/1 for sure sometimes would be wanted to be categorical, and sometimes to be continuous
-> users would want to switch the annotated type sometimes for sure, which wouldnt be doable with on-the-fly type inference
Description of feature
We currently have three issues (#637, #649, #662) dealing with how we determine feature types and the downstream problems that arise due to different approaches. I'll provide a summary of these issues and associated ToDos here, while closing the other three issues.
Problem: We've introduced PR #697 to accurately determine feature types in ehrapy. With the new method
ep.ad.infer_feature_types
, feature types are guessed based on predefined rules and we prompt the user to review these annotations. Currently, feature determination occurs at multiple stages inconsistently and is saved in the adata at several places. Ideally, we would harmonize ehrapy to use exclusively useep.ad.infer_feature_types
for feature annotation, eliminating guesswork in downstream analyses. This means that the new method would be part of the standard preprocessing steps.ToDos:
autodetect
option of encoding here relies on theadata.var[EHRAPY_TYPE_KEY]
tag, which is set when (1) reading a dataframe here or (2) moving something from obs to X here. Ideally, we would get rid of theadata.var[EHRAPY_TYPE_KEY]
annotation entirely and, ifautodetect
is set toTrue
in the encoding method, base the identification of features to encode on the annotation fromep.ad.infer_feature_types
.adata.uns["var_to_encoding"]
andadata.uns["encoding_to_var"]
.ep.ad.infer_feature_types
.ep.tl.rank_features_groups
to be based onep.ad.infer_feature_types
.ep.ad.infer_feature_types
.ep.ad.infer_feature_types
to also work with date(time)s stored as strings and update the FHIR tutorial accordingly.df_to_anndata
, which also does a lot of type inference.