Zethson commented 2 years ago

Motivation

EHR data has a lot of missing values. Currently we are calculating them using https://ehrapy.readthedocs.io/en/development/usage/preprocessing/ehrapy.preprocessing.qc_metrics.html#ehrapy.preprocessing.qc_metrics and storing them in obs/var. Then we just create a couple of manual barplots later. This is not that pretty.

Proposed solution

We should try to investigate https://github.com/ResidentMario/missingno and any other missing value visualization that we can find on Google etc. Then we adapt and add any interesting plots that we deem useful and add them to pl. Ideally we reuse the existing values stored in obs/var somehow (calculated with qc_metrics from above). If this is not reasonable possible we have to investigate whether we can adapt our calculations somehow to make this seamlessly work.

namsaraeva commented 2 years ago

The example at https://ehrapy.readthedocs.io/en/development/usage/preprocessing/ehrapy.preprocessing.qc_metrics.html#ehrapy.preprocessing.qc_metrics is slightly wrong: ep.pp.calculate_qc_metrics(adata) should be replaced with ep.pp.qc_metrics(adata).

1) The msno.bar plot seems like the most simple graphic to me, but I am not sure how many columns (aka var_names) can be placed comfortably on such plot.

2) The msno.heatmap is quite nice, but I don't think pairwise correlations are extremely interesting when working with EHR, I would like to see a bigger image (what do you think?).

3) With the msno.dendrogram we could reveal broader trends between multiple variables, e.g. how does the extent of completion of notes when admitted to hospital affects the number of additional exams done afterwards (like poor notes could lead to more unnecessary exams and very detailed notes might lead to more precise exams, but maybe my thinking is wrong).

What do you think?

Zethson commented 2 years ago

Great!

The example at https://ehrapy.readthedocs.io/en/development/usage/preprocessing/ehrapy.preprocessing.qc_metrics.html#ehrapy.preprocessing.qc_metrics is slightly wrong: ep.pp.calculate_qc_metrics(adata) should be replaced with ep.pp.qc_metrics(adata).

Good spot. Please submit a tiny PR with the fix :)

1. The `msno.bar` plot seems like the most simple graphic to me, but I am not sure how many columns (aka `var_names`) can be placed comfortably on such plot.

I think that this is a general issue. The plotting functions probably need to be a bit flexible when it comes to data selection and visualization.

2. The `msno.heatmap` is quite nice, but I don't think pairwise correlations are extremely interesting when working with EHR, I would like to see a bigger image (what do you think?).

Mhmm, not sure. Might help with the determination of https://en.wikipedia.org/wiki/Missing_data#Missing_not_at_random data? Don't think that it hurts.

3. With the `msno.dendrogram` we could reveal broader trends between multiple variables, e.g. how does the extent of completion of notes when admitted to hospital affects the number of additional exams done afterwards (like poor notes could lead to more unnecessary exams and very detailed notes might lead to more precise exams, but maybe my thinking is wrong).

Cool!

What do you think?

See above. Generally, we also need to figure out some technical details.

What is the input for any types of these plots?
How does it differ from what we currently have in our AnnData object? A Numpy matrix, a DataFrame?
Can we store/reuse intermediate results somewhere?
Can we plot the missing value information somehow on the UMAPs eventually still?

https://mybinder.org/ https://colab.research.google.com/

Imipenem commented 2 years ago

Can we plot the missing value information somehow on the UMAPs eventually still?

They are already plotted by coloring in a very light grey, when plotting the originals layer.

theislab / ehrapy

Improve missing value visualizations #271

Motivation

Proposed solution