Feature: Find datadicts matching a set of conditions

toolsforexperiments / plottr

A flexible plotting and data analysis tool.

https://github.com/toolsforexperiments/plottr

MIT License

46 stars 55 forks source link

Feature: Find datadicts matching a set of conditions #379

Closed yoshi74ls181 closed 1 year ago

yoshi74ls181 commented 1 year ago

This pull request adds a method plottr.data.datadict_storage.search_datadicts, which returns an iterator over datadicts matching a set of conditions. The following conditions are currently supported:

since: Date (and time) in the format YYYY-mm-dd (or YYYY-mm-ddTHHMMSS).
until: Date (and time) in the format YYYY-mm-dd (or YYYY-mm-ddTHHMMSS). If not given, default to until = since.
name: Name of the dataset (if not given, match all datasets).

For convenience, I've also added a method plottr.data.datadict_storage.search_datadict, which asserts that there is only one matching datadict.

yoshi74ls181 commented 1 year ago

Resolved a merge conflict with #375.

marcosfrenkel commented 1 year ago

I really like this feature! But at the moment if the search encounters any invalid data (the writer always creates a file even if the nothing is inside of it) the whole search fails. Because of this, it is hard to test on my end.

I am also a little unsure if its a good idea that the search_datadicts returns the generator instead of a list with all the matching datadicts. It is a good idea to have the generator since the datadicts might be big, but having both the generators and a function that returns a list might be a good idea too and shouldn't take much effort. @wpfff what do you think?

yoshi74ls181 commented 1 year ago

Thanks! I think I've resolved the error you encountered by fixing a bug in datadict_from_hdf5. Could you test this again?

yoshi74ls181 commented 1 year ago

Added the following search conditions:

only_complete: Only return datadicts tagged as complete. Defaults to True.
skip_trash: Skip datadicts tagged as trash. Defaults to True.

marcosfrenkel commented 1 year ago

Hello sorry for the late response, its been a busy couple of weeks.

I remember being able to test this but no matter how I try now the generator is always empty. @yoshi74ls181 could you give me an example of how it is supposed to be used?

yoshi74ls181 commented 1 year ago

No worries! Sorry about flooding you with many pull requests recently, I don't mean to rush you at all.

Here's a usage example:

from plottr.data.datadict_storage import DataDict, DDH5Writer, search_datadicts, search_datadict

basedir = "C:\\plottr-data"

# create two datasets
data = DataDict(x=dict(), y=dict(axes=["x"]))
with DDH5Writer(data, basedir, name="test") as writer:
    writer.add_data(x=[1, 2, 3], y=[1, 2, 3])
data = DataDict(x=dict(), y=dict(axes=["x"]))
with DDH5Writer(data, basedir, name="test") as writer:
    writer.add_data(x=[1, 2, 3], y=[3, 2, 1])

# print all datasets named "test" from today
for foldername, datadict in search_datadicts(basedir, "2023-03-17", name="test"):
    print(foldername, datadict["x"]["values"], datadict["y"]["values"])

# print just the newest one
foldername, datadict = search_datadict(basedir, "2023-03-17", name="test", newest=True)
print(foldername, datadict["x"]["values"], datadict["y"]["values"])

# print the one with specific date and time
foldername, datadict = search_datadict(basedir, "2023-03-17T200540", name="test")
print(foldername, datadict["x"]["values"], datadict["y"]["values"])

wpfff commented 1 year ago

@yoshi74ls181 off-topic, but i couldn't find a way to message you in a different way :) it was great meeting you at the APS meeting! could you maybe let me know your email address? (you can email me directly at wpfaff at illinois dot edu)

yoshi74ls181 commented 1 year ago

@wpfff Have you received my email? I'm worried that it might have ended up in your spam folder because I sent it from my personal gmail account (I lost access to my university email when I graduated). No worries if it's just that you've been busy.

wpfff commented 1 year ago

this function is useful, and we have a similar one in our lab code -- but i'm not sure it should be part of plottr itself. there's a few conceptual issues:

it's hard to make this useable from the monitr gui
it assumes a particular way of data naming/storing, which we don't want to enforce in the package (currently you can easily change how naming works by making your own data writer, and everything else will keep working)

we're currently thinking on how to filter better in monitr, but we're not sure yet on the correct approach. I'm closing this for now, and we can re-open if needed.