Open chaichontat opened 2 years ago
Hi,
thank you very much for this high quality issue with even a small POC!
The number of columns in obs and var can also grow to the point that we lose track of how each column came to be.
There have been discussions to record all operations with modified an AnnData object @ivirshup @falexwolf.
Functions would rely less on side-effects, which would lead to more predictable outcomes in the long run. Another benefit would be the possibility of safely parallelizing pipelines in the future.
Agree. This would be great.
I'd like to note that everything modifying one fat AnnData object inplace is in my opinion very beginner friendly. But I can certainly see the issues and benefits that you outline. Without making any recommendations (I need to think about this more) I am interested in what others are thinking.
@chaichontat, thank you for the in-depth elaboration and proposals!
I agree on the downsides of the present in-place data science workflows that you're pointing out.
I agree with @Zethson that for simple workflows, the current API seems very intuitive and beginner-friendly.
But I also think that for complex cases the problems are indeed so severe that there is indeed much danger to making mistakes and getting lost. A more safe/robust and more convenient/readable/intepretable API would be absolutely desirable.
There are a variety of ways of arriving there. I think that all of these ways have in common that the book-keeping of the data science workflow (API calls) needs to be dealt with explicitly. Right now, it's merely implicit in the order of keys added to AnnData. Instead, "Bookkeeping" and "datacontaining" should likely be separated with explicit objects.
I could imagine though, that the data scientist may still just work with one combined data container object, just that that container object is subject to "strong supervision" from a bookkeeper object in the background. I'd suggest calling such a data container one that displays "conditional/supervised immutability". Already written slots of data within the container cannot be mutated, unless the Bookkeeper determines it's safe. Or if it's append-only.
After studying and learning a lot, I think the solution you prototyped @chaichontat seems a very good way of resolving the mentioned issues through several very elegant ideas! In my view, to produce an elegant workflow user experience that could approach the simplicity of the current one, it should be complemented by a Bookkeeper. I think this is the direction in which you're going by saying
Mappings from obs or var to this container could be created to maintain backward compatibility.
I'll also need to think more about it. And how all of this can be reconciled by people who want a more object-oriented API.
I'm looking forward to reading what others say!
static-frame
actually has a FrameGO
object that is append-only.
We could also integrate anndata
into a data version control system, similar to what the ML people are using. We get appendable data with full history.
https://dvc.org/doc/use-cases/versioning-data-and-model-files
This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!
Hi! I have been using
scanpy
andanndata
and start noticing potential difficulties as we analyze more complex datasets.Problem
There can be a ballooning number of implicit dependencies in
adata
. For example, runningsc.pl.pca_variance_ratio
requires runningsc.tl.pca
first, the results of which are written directly intoadata
. This may not seem like a big problem but can lead to inconsistencies down the line if the we run the first function with different parameters but forget to run the second again. The number of columns inobs
andvar
can also grow to the point that we lose track of how each column came to be. Each step of the function could accidentally cause side effects that the user may not be aware of.Proposed solution
This issue could be fixed by making
anndata
and its applications more functional. By makinganndata
immutable, we could prevent inadvertent modification and cache intermediate results usinglru_cache
. This way, we won't need to modifyadata
within a session. Important or expensive functions would return their results in its own container or dataclass with methods for working with those. Functions would rely less on side-effects, which would lead to more predictable outcomes in the long run. Another benefit would be the possibility of safely parallelizing pipelines in the future.At the end of the session, expensive results could be saved along with their associated parameters as a folder in an HDF5 file, an automatic data provenance mechanism. The user could then pick which of the fields they want to save. Mappings from
obs
orvar
to this container could be created to maintain backward compatibility.Thanks for reading! Let me know if you are interested in this approach.
Example implementation