Creation of MuData object is very slow with multiple AnnData objects

Imipenem commented 2 years ago

Describe the bug Hey, I'm currently working on a project, where I'm using multiple AnnData objects. And therefore I wanted to use mudata (0.1.0).

My test data consists of several smaller .csv files (around 1-2 MB usually) and one bigger file (like 75MB). So summed up I have like 20 .csv files (each read seperatly and created an AnnData object for) with a total size of about 100MB.

I tested it and it seems like the update function is taking very long time to finish. Is there anything that could be done to speedup this update step? Or is it even necessary for a basic handling of the object to update?

Many thanks in advance

To Reproduce Steps to reproduce the behaviour.

I'm using the data from https://physionet.org/static/published-projects/mimiciii-demo/ (the ZIP folder).

So what I basically just do is: 1.) Read csv files using pandas standard function 2.) Create an AnnData object for each file and add it to mod 3.) Call update() when every object has been added

(Note: without calling update() at the end, this runs much much faster).

Expected behaviour A clear and concise description of what you expected to happen. I'd like to know, if there is a possibility to speedup the update part or if its even necessary to update at all?

System

OS: Ubuntu
Python version: 3.8
Versions of libraries involved: mudata 0.1.0 , anndata 0.7.6

gtca commented 2 years ago

Hey @Imipenem, thanks for you feedback! Do you think you could provide more detailed info on what AnnData objects you're creating from the data you linked so that we can make sure we understand the problem better?

In particular, many of those .csv files seem to contain metadata together with numerical data. Importantly, what is the common obs_names dimension across the files that we should use? row_id doesn't seem to be it.

Imipenem commented 2 years ago

Thank you @gtca for your quick response and sorry for my delayed one.

provide more detailed info on what AnnData objects you're creating from the data

So the general purpose is as following: We ware working on an application, that will have to deal with multiple different datasets from EHRs (Electronic Health Records). So the datasets will contain numerical and non-numerical (categorical strings or free text), as it is the case in the dataset I linked above.

So for each, for example .csv file in the dataset, we will create a new AnnData object.

X in each object will initially contain the "mixed" data (thus X initially has the object datatype).
var should be empty initially but will be used when the data is processed further on
obs with n obs could either be indexed from 1-n or with a custom defined column (for example the subject_id column in our example dataset)

Its important to state, that the different AnnData objects will differ in shape and have different values and datatypes (for example the subject_id of a patient A in File1.csv does not have to appear in FileB.csv (or in a different number)), so those two AnnData object willl likely not have the same n_obs.

So the "metadata" (the non numerical data) is an essential part of this AnnData object and will be stored in X at least at the first read (although this data could be later encoded into "numerical" data).

Importantly, what is the common obs_names dimension across the files that we should use? row_id doesn't seem to be it

Do you mean like a "common column" across all the files/AnnData objects? In this case this could be subject_id, as this (in this example) is the identifier for different patients. If I misunderstood your question (or anything else is still unclear), feel free to correct me ;).

Many thanks

gtca commented 2 years ago

Hey @Imipenem, thanks for the explanation, I see now that AnnData objects use dtype="object" for .X. While it should be alright that obs are different (but hopefully intersecting) between datasets, it looks to me like in most files the data is in the «long» format, and this doesn't seem to me what AnnData / MuData are designed for, e.g. subject_id is not unique in most files. (Wouldn't e.g. relational databases be a better fit for such data?) My expectation here would be to rather work with sparse «wide» tables (pandas.DataFrame.pivot) if AnnData objects are to be used.

MuData in particular traces observations between modalities so having multiple ones with the same name makes joins ambiguous (in fact, it does show a warning about that). If just row numbers are to be used for obs_names of AnnData objects, not sure why MuData is needed to store that: row 10 in table A is not the same as row 10 in table B, etc. If the subject_id column is used for obs_names, there's the aforementioned «long» format issue; adata will rightfully suggest running .obs_names_make_unique() upon creation — but this will make all the entries from the same subject unique in each AnnData object and will blow up total n_obs to 760937 for the data linked above.

That being said, now that we can replicate the slower-than-desired update, we'll look into that and will try to speed it up.

Imipenem commented 2 years ago

@gtca Thank you very much for the detailed explanation. Unfortunately, relational databases etc. are not an option for us at the moment, since we would then have to rewrite basically everything from scratch. I can definitely see that our type of data (especially the unqiue obs issue) is not, what AnnData/MuData were designed for, since we will have multiple rows in a single AnnData object that will belong to a single patient (or "cell" ;)).

That being said, now that we can replicate the slower-than-desired update, we'll look into that and will try to speed it up.

Thanks a lot, would be cool to see this ;).

Zethson commented 2 years ago

We will solve this issue internally in other ways.

ivirshup commented 2 years ago

Should this get closed? It seems like there is still the issue of slow updates

Zethson commented 2 years ago

@ivirshup the general title may be missleading here. @Imipenem used datasets which really should not result in a MuData object in the end. It made no sense.

We can of course keep this issue open as a reminder, but I had the impression that more specific performance issues with the update function are tracked elsewhere (e.g. #16 #18 ).

@scverse/muon feel free to do whatever you think is best here :)

Imipenem commented 2 years ago

Would disagree to some extent here: Title is somehow misleading yeah, but I guess it would be cool to figure out, whether it would be possible to somehow use MuData with our datasets (multiple rows could belong to one patient over multiple files) or not.

gtca commented 2 years ago

Thanks, @Zethson, indeed, #16 and #18 should resolve once fully implemented. @ivirshup, I think the speed improvement provided in #17 should resolve the issue. Addressing #18 would also help but not #1 as here the problem was even within a single run of .update(). As mentioned above and discussed with @Imipenem, there is probably an issue of storing «long» data in AnnData to begin with.

scverse / mudata

Creation of MuData object is very slow with multiple AnnData objects #1