Open afrendeiro opened 5 years ago
I like this idea, but think it could be expanded on a bit. I think there are benefits to approaching this through logging. Some advantages of doing this through logging:
What if tracking was done through logging? Here's a couple quick examples of what I mean:
Interesting ideas. I actually didn't notice scanpy has no logging implemented - this would indeed be useful and could already solve half the problem indeed. However, I doubt the best way to go about this would be post hoc with decorators etc, but rather intrinsically throughout the various API functions.
Regardless of logging, I still think that having something which is intrinsically attached to the object would have the advantage of knowing the exact set of operations solely from the h5ad file/AnnData object itself. Don't know if people are actually out there are also sharing these or not but it could be useful from that perspective too.
Scanpy does have logging implemented (examples: neighbors, highly variable genes), but it's not that widely used. I think this is because it has to be implemented manually in the code (not sure if this is what you mean by "intrinsic"?), which makes it take some effort to implement and not all contributors are aware of.
I think using a decorator would be nice for abstracting out the process. This would have benefits of consistency of usage by making it easy, consistency of logged messages, and separation of concerns between computation and tracking.
I also think you'd be able to know the exact set of operations from this approach. Assuming all top level functions have been wrapped with a decorator like the one I presented above, this code:
adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")
Should result in a set of (psuedo-)records like:
# Where id(1) is a stand in for value like `id(adata)`
{"call": "read_10x_h5", "args": {"filename": "./10x_run/outs/filtered_gene_matrix.h5"}, "returned_adata": id(1)}
{"call": "normalize_per_cell", "args": {"counts_per_cell_after": 1000}, "adata_id": id(1)}
{"call": "log1p", "adata_id": id(1)}
{"call": "pca", "adata_id": id(1)}
{"call": "write", "args" : {"filename": "./cache/01_simple_process.h5ad"}, "adata_id": id(1)}
It's pretty trivial to go through these logs and figure out what happened to the AnnData, and made accessible through helper functions. Maybe they'd look like sc.logging.get_operations(adata_id=id(adata))
or sc.logging.get_operations(written_to="./cache/01_simple_process.h5ad")
. There could also be a helper function to add the relevant records to some field in .uns
of the relevant AnnData object or a setting which has a log handler do that automatically.
Is there some set of information this wouldn't capture?
Sorry I missed the logging. I also didn't see the sc.settings.logfile
option, which obviously makes absolute sense and is convenient to have persistent records when working interactively with anndata objects. I guess just more consistent logging across scanpy functions would be really great.
Something like sc.logging.get_operations(adata_id=id(adata))
would also be super cool, but would it be able to retrieve records of operations performed within rounds of object serialization?
e.g.:
adata = sc.read_10x_h5("./10x_run/outs/filtered_gene_matrix.h5")
sc.pp.normalize_per_cell(adata, 1000)
sc.pp.log1p(adata)
sc.pp.pca(adata)
adata.write("./cache/01_simple_process.h5ad")
adata = sc.read("./cache/01_simple_process.h5ad")
sc.pp.scale(adata)
adata.write("./cache/01_simple_process.h5ad")
print(sc.logging.get_operations(adata_id=id(adata)))
would probably forget the first set of operations?
# Where id(1) is a stand in for value like `id(adata)`
{"call": "scale", "adata_id": id(1)}
{"call": "write", "args" : {"filename": "./cache/01_simple_process.h5ad"}, "adata_id": id(1)}
I guess one solution would be to follow the path of ids up the log to retrieve all which seems doable, so this could be a good system.
The one thing this wouldn't cover though is persistence within the h5ad object itself. This would be useful in the case of sharing the object with someone for example. As I mentioned before, I'm not sure this is a widespread use case yet, but could be useful.
Yeah that's what I was thinking for tracking between serializations. I figure there could be a boolean argument like exhaustive
which would signify whether you want this particular AnnData or all previous AnnData
s this could be derived from in the logs.
I think it'll be possible to write the logs to some field in an object. There is a question of how complicated this would be to implement, which I haven't quite figured out yet. Maybe you'd add a reference to the AnnData to the logging context, and make a custom logger which decides where to write based on that? Alternatively, maybe this just gets handled by some logic in the decorator. So after a method is called, there's a flag about whether to add records to the modified object.
Of course, nothing is logged persistently by default, so it's already an extra step to enable. It's possible sharing could just need two extra steps, "enable persistent logging" and "send them the logs".
sc.logging.get_operations
with exhaustive
would be great, but if one could find a way to store the same persistently or in the object too upon the user's request that would cover all the ground.
When exploring various options of preprocessing data, I try to avoid having several copies of AnnData objects in memory if they're not sparse, so I save them to h5ad at key steps. Sometimes alas, after a few iterations I re-write stuff and forget what operations have been performed in my "X" (particularly in the preprocessing steps).
So, because being lazy makes me creative, I started tracking these in the object itself (see example https://gist.github.com/afrendeiro/7ccaf324bfdbff042ae36f734f544860) by decorating the preprocessing functions post hoc (this could even easily be used to save the values of kwargs passed potentially).
I wonder if an internal implementation of this would be of broad interest, particularly for functions which modify "X" inplace? Of course this would be no replacement for proper documentation of one's steps, etc but I thought it could be an interesting addition to scanpy in any case.