Closed Zethson closed 1 year ago
The example at https://ehrapy.readthedocs.io/en/development/usage/preprocessing/ehrapy.preprocessing.qc_metrics.html#ehrapy.preprocessing.qc_metrics is slightly wrong: ep.pp.calculate_qc_metrics(adata)
should be replaced with ep.pp.qc_metrics(adata)
.
1) The msno.bar
plot seems like the most simple graphic to me, but I am not sure how many columns (aka var_names
) can be placed comfortably on such plot.
2) The msno.heatmap
is quite nice, but I don't think pairwise correlations are extremely interesting when working with EHR, I would like to see a bigger image (what do you think?).
3) With the msno.dendrogram
we could reveal broader trends between multiple variables, e.g. how does the extent of completion of notes when admitted to hospital affects the number of additional exams done afterwards (like poor notes could lead to more unnecessary exams and very detailed notes might lead to more precise exams, but maybe my thinking is wrong).
What do you think?
Great!
The example at https://ehrapy.readthedocs.io/en/development/usage/preprocessing/ehrapy.preprocessing.qc_metrics.html#ehrapy.preprocessing.qc_metrics is slightly wrong:
ep.pp.calculate_qc_metrics(adata)
should be replaced withep.pp.qc_metrics(adata)
.
Good spot. Please submit a tiny PR with the fix :)
1. The `msno.bar` plot seems like the most simple graphic to me, but I am not sure how many columns (aka `var_names`) can be placed comfortably on such plot.
I think that this is a general issue. The plotting functions probably need to be a bit flexible when it comes to data selection and visualization.
2. The `msno.heatmap` is quite nice, but I don't think pairwise correlations are extremely interesting when working with EHR, I would like to see a bigger image (what do you think?).
Mhmm, not sure. Might help with the determination of https://en.wikipedia.org/wiki/Missing_data#Missing_not_at_random data? Don't think that it hurts.
3. With the `msno.dendrogram` we could reveal broader trends between multiple variables, e.g. how does the extent of completion of notes when admitted to hospital affects the number of additional exams done afterwards (like poor notes could lead to more unnecessary exams and very detailed notes might lead to more precise exams, but maybe my thinking is wrong).
Cool!
What do you think?
See above. Generally, we also need to figure out some technical details.
- Can we plot the missing value information somehow on the UMAPs eventually still?
They are already plotted by coloring in a very light grey, when plotting the originals
layer.
Motivation
EHR data has a lot of missing values. Currently we are calculating them using https://ehrapy.readthedocs.io/en/development/usage/preprocessing/ehrapy.preprocessing.qc_metrics.html#ehrapy.preprocessing.qc_metrics and storing them in obs/var. Then we just create a couple of manual barplots later. This is not that pretty.
Proposed solution
We should try to investigate https://github.com/ResidentMario/missingno and any other missing value visualization that we can find on Google etc. Then we adapt and add any interesting plots that we deem useful and add them to
pl
. Ideally we reuse the existing values stored in obs/var somehow (calculated with qc_metrics from above). If this is not reasonable possible we have to investigate whether we can adapt our calculations somehow to make this seamlessly work.