laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
127 stars 10 forks source link

✨ Track queries per run #2052

Open falexwolf opened 2 weeks ago

falexwolf commented 2 weeks ago

We're tracking input artifacts & collections when either of them are loaded, i.e., one of cached, loaded, opened, iterated-over.

What we don't track as inputs are queries of any entity: .get() and .filter() statements.

We'll have these in an audit log but the question is how easy that's going to be to decipher if there is no run record.

Shall we also track this information? It'd lead to another big number of link tables because, e.g., there'd be a cell_type__queried_in_runs table (via a many-to-many CellType._queried_in_runs).

Opinions: @sunnyosun @Zethson @chaichontat

Zethson commented 2 weeks ago

My immediate intuition says: "Not important at the moment". It'd be nice to have to get a feeling for the usage of datasets and is a cool feature but I don't see many use cases.

I'm voting "no" for now unless you have a few great use cases in mind?

falexwolf commented 2 weeks ago

I agree that these are very advanced use cases.

But one thing I've heard again and again from the data architects & engineers is that "everything should be tracked".

The thing is: it's impossible to make unmeasured data appear. So, most platforms instrument as much as they can even if it's almost never used.

Knowing how much, when, by whom, through which code etc. a CellType record was queried is incredibly fine-grained; I know. Still it could be useful in instances.

The main downside of a naive implementation is the query speed it takes, but what could be done is keeping a cached log of all these operations and then commit them in one transaction upon .finish().

Hm. Let's keep in the backlog.