Tracking operations performed on AnnData objects

afrendeiro commented 5 years ago

When exploring various options of preprocessing data, I try to avoid having several copies of AnnData objects in memory if they're not sparse, so I save them to h5ad at key steps. Sometimes alas, after a few iterations I re-write stuff and forget what operations have been performed in my "X" (particularly in the preprocessing steps).

So, because being lazy makes me creative, I started tracking these in the object itself (see example https://gist.github.com/afrendeiro/7ccaf324bfdbff042ae36f734f544860) by decorating the preprocessing functions post hoc (this could even easily be used to save the values of kwargs passed potentially).

I wonder if an internal implementation of this would be of broad interest, particularly for functions which modify "X" inplace? Of course this would be no replacement for proper documentation of one's steps, etc but I thought it could be an interesting addition to scanpy in any case.

ivirshup commented 5 years ago

I like this idea, but think it could be expanded on a bit. I think there are benefits to approaching this through logging. Some advantages of doing this through logging:

Global record. If a copy is made or an AnnData split, you could figure out which object came from where.
Control over level of detail. What kind of information is recorded can be customized. Maybe the user wants provenance, but maybe they want performance information.

What if tracking was done through logging? Here's a couple quick examples of what I mean:

Simple example. Logs `anndata` used, function called, time elapsed

```python from anndata import AnnData from datetime import datetime from functools import wraps from structlog import get_logger from time import sleep import uuid logger = get_logger() def logged(func): @wraps(func) def func_wrapper(*args, **kwargs): call_id = uuid.uuid4() # So we can always match call start with call end call_start_record = dict(call_id=call_id, called_func=func.__name__) if type(args[0]) is AnnData: call_start_record["adata_id"] = id(args[0]) logger.msg("call", **call_start_record) t0 = datetime.now() output = func(*args, **kwargs) dt = datetime.now() - t0 call_finish_record = dict(called_func=func.__name__, elapsed=dt) if type(output) is AnnData: call_finish_record["returned_adata_id"] = id(output) logger.msg("call_finish", **call_finish_record, call_id=call_id) return output return func_wrapper # Usage @logged def foo(adata, x, copy=False): sleep(0.5) if copy: return adata.copy() import scanpy as sc pbmcs = sc.datasets.pbmc68k_reduced() foo(pbmcs, 1) # 2019-02-13 19:27.58 call adata_id=4937049368 call_id=UUID('82f3944c-08c1-470a-9d39-03dcabc091a2') called_func=foo # 2019-02-13 19:27.58 call_finish call_id=UUID('82f3944c-08c1-470a-9d39-03dcabc091a2') called_func=foo elapsed=datetime.timedelta(microseconds=500777) foo(pbmcs, 1, copy=True); # 2019-02-13 19:28.02 call adata_id=4937049368 call_id=UUID('986f57e4-656a-41b1-9c7c-a7c5ad5b01fc') called_func=foo # 2019-02-13 19:28.03 call_finish call_id=UUID('986f57e4-656a-41b1-9c7c-a7c5ad5b01fc') called_func=foo elapsed=datetime.timedelta(microseconds=505970) returned_adata_id=4940502352 ```

More complicated example with argument value logging

```python from anndata import AnnData from copy import copy from datetime import datetime from functools import wraps import inspect from itertools import chain from structlog import get_logger from time import sleep import uuid logger = get_logger() def logged(logged_args=None): """ Params ------ logged_args : list[str], optional (default: `None`) Names of arguments to log. """ if logged_args is None: logged_args = [] def logged_decorator(func): argnames = inspect.getfullargspec(func).args @wraps(func) def func_wrapper(*args, **kwargs): call_id = uuid.uuid4() # So we can always match call start with call end logged_params = {} for param, val in chain(zip(argnames, args), kwargs.items()): if type(val) is AnnData: logged_params[param] = id(val) elif param in logged_args: logged_params[param] = copy(val) # Probably need to consider how these values get logged logger.msg("call", called_func=func.__name__, logged_args=logged_params, call_id=call_id) t0 = datetime.now() output = func(*args, **kwargs) dt = datetime.now() - t0 call_finish_record = dict( called_func=func.__name__, elapsed=dt, ) if type(output) is AnnData: call_finish_record["returned_adata_id"] = id(output) logger.msg("call_finish", **call_finish_record, call_id=call_id) return output return func_wrapper return logged_decorator # Usage @logged(logged_args=["x"]) def foo(adata, x, copy=True): sleep(0.5) if copy: return adata.copy() import scanpy as sc pbmcs = sc.datasets.pbmc68k_reduced() foo(pbmcs, 1, copy=True) # 2019-02-13 19:35.48 call call_id=UUID('f7623504-31c5-4afa-ae26-4f58fc5341a8') called_func=foo logged_args={'adata': 4476410456, 'x': 1} # 2019-02-13 19:35.48 call_finish call_id=UUID('f7623504-31c5-4afa-ae26-4f58fc5341a8') called_func=foo elapsed=datetime.timedelta(microseconds=507880) returned_adata_id=5064494384 ```

afrendeiro commented 5 years ago

Interesting ideas. I actually didn't notice scanpy has no logging implemented - this would indeed be useful and could already solve half the problem indeed. However, I doubt the best way to go about this would be post hoc with decorators etc, but rather intrinsically throughout the various API functions.

Regardless of logging, I still think that having something which is intrinsically attached to the object would have the advantage of knowing the exact set of operations solely from the h5ad file/AnnData object itself. Don't know if people are actually out there are also sharing these or not but it could be useful from that perspective too.

ivirshup commented 5 years ago

Scanpy does have logging implemented (examples: neighbors, highly variable genes), but it's not that widely used. I think this is because it has to be implemented manually in the code (not sure if this is what you mean by "intrinsic"?), which makes it take some effort to implement and not all contributors are aware of.

I think using a decorator would be nice for abstracting out the process. This would have benefits of consistency of usage by making it easy, consistency of logged messages, and separation of concerns between computation and tracking.

I also think you'd be able to know the exact set of operations from this approach. Assuming all top level functions have been wrapped with a decorator like the one I presented above, this code:

adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")

Should result in a set of (psuedo-)records like:

# Where id(1) is a stand in for value like `id(adata)`
{"call": "read_10x_h5", "args": {"filename":  "./10x_run/outs/filtered_gene_matrix.h5"}, "returned_adata": id(1)}
{"call": "normalize_per_cell", "args": {"counts_per_cell_after": 1000}, "adata_id": id(1)}
{"call": "log1p", "adata_id": id(1)}
{"call": "pca", "adata_id": id(1)}
{"call": "write", "args" : {"filename": "./cache/01_simple_process.h5ad"}, "adata_id": id(1)}

It's pretty trivial to go through these logs and figure out what happened to the AnnData, and made accessible through helper functions. Maybe they'd look like sc.logging.get_operations(adata_id=id(adata)) or sc.logging.get_operations(written_to="./cache/01_simple_process.h5ad"). There could also be a helper function to add the relevant records to some field in .uns of the relevant AnnData object or a setting which has a log handler do that automatically.

Is there some set of information this wouldn't capture?

afrendeiro commented 5 years ago

Sorry I missed the logging. I also didn't see the sc.settings.logfile option, which obviously makes absolute sense and is convenient to have persistent records when working interactively with anndata objects. I guess just more consistent logging across scanpy functions would be really great.

Something like sc.logging.get_operations(adata_id=id(adata)) would also be super cool, but would it be able to retrieve records of operations performed within rounds of object serialization?

e.g.:

adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")

adata = sc.read("./cache/01_simple_process.h5ad")
sc.pp.scale(adata)
adata.write("./cache/01_simple_process.h5ad")
print(sc.logging.get_operations(adata_id=id(adata)))

would probably forget the first set of operations?

# Where id(1) is a stand in for value like `id(adata)`
{"call": "scale", "adata_id": id(1)}
{"call": "write", "args" : {"filename": "./cache/01_simple_process.h5ad"}, "adata_id": id(1)}

I guess one solution would be to follow the path of ids up the log to retrieve all which seems doable, so this could be a good system.

The one thing this wouldn't cover though is persistence within the h5ad object itself. This would be useful in the case of sharing the object with someone for example. As I mentioned before, I'm not sure this is a widespread use case yet, but could be useful.

ivirshup commented 5 years ago

Yeah that's what I was thinking for tracking between serializations. I figure there could be a boolean argument like exhaustive which would signify whether you want this particular AnnData or all previous AnnDatas this could be derived from in the logs.

I think it'll be possible to write the logs to some field in an object. There is a question of how complicated this would be to implement, which I haven't quite figured out yet. Maybe you'd add a reference to the AnnData to the logging context, and make a custom logger which decides where to write based on that? Alternatively, maybe this just gets handled by some logic in the decorator. So after a method is called, there's a flag about whether to add records to the modified object.

Of course, nothing is logged persistently by default, so it's already an extra step to enable. It's possible sharing could just need two extra steps, "enable persistent logging" and "send them the logs".

afrendeiro commented 5 years ago

sc.logging.get_operations with exhaustive would be great, but if one could find a way to store the same persistently or in the object too upon the user's request that would cover all the ground.

scverse / scanpy

Tracking operations performed on AnnData objects #472