theislab / ehrapy

Electronic Health Record Analysis with Python.
https://ehrapy.readthedocs.io/
Apache License 2.0
238 stars 19 forks source link

Integration of MedCAT #101

Closed Zethson closed 2 years ago

Zethson commented 3 years ago

To extract keywords from free text notes we will be integrating MedCAT.

The goals are as follows:

  1. Allow for the training of own models with own vocabularies
  2. Allow for the extraction of keywords
  3. Ensure that keywords are associated with attributes like quantitative or quality information
  4. Extract a word2vec embedding and save it somewhere in the AnnData object
  5. Ensure that we have a easily runnable default vocab etc
  6. n_proc of multiprocessing should respect n_jobs of our settings object

Tasks in somewhat reasonable order

Zethson commented 3 years ago

https://github.com/CogStack/MedCAT/issues/145

Zethson commented 3 years ago

Was merged, but no release yet.

Add Github tag with https://github.com/python-poetry/poetry/issues/313

Zethson commented 3 years ago

I think that free text should not be part of X in its raw form. I'd add the free text to obs only and allow for an embedding column/matrix to be appended to X after MedCAT was applied.

Imipenem commented 3 years ago

I think that free text should not be part of X in its raw form. I'd add the free text to obs only and allow for an embedding column/matrix to be appended to X after MedCAT was applied.

Yes, I will take care of implementing an "autodetect" feature for this, so users are not forced to pass every free text column for obs only when creating the MuData/AnnData object.

Zethson commented 2 years ago

1.2.6 was released and we should be ready to implement this now.

Zethson commented 2 years ago

Rewrote our current implementation to work with the latest MedCAT. Think this still requires a redesign.

Imipenem commented 2 years ago

As discussed: Keep a "main" MedCat object, so we do not loose any results.

Add a function to nicely display such an object, for easier navigation by the user.

Add pp functions to filter the object for specific values (like tui, cui, type of disease, symptoms, etc)

Add a function that can return a binary column based on user filtering (e.g. which row contains for example pulmonary diseases). So the actual values (which might be multiple values in one row) never need to be actually stored in the AnnData object, only indicators when they are needed.

Add a decorator or overwrite plotting functions like umap, pca etc, for example when coloring by pulmonary disease (y/n). This column might not be present in X or obs, so add it using the way described right above.

Imipenem commented 2 years ago

Did I miss something @Zethson ?

Zethson commented 2 years ago

No, sounds great. Feel free to show early drafts so that we can evaluate our approach before doubling down.

Thank you!