radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

Filter support on the command line #19

Open jrs65 opened 7 years ago

jrs65 commented 7 years ago

At the moment the command line tools (at best) have an --acq option for restricting operations like sync and clean to subsets of the archive. It would be great to replace this with a more generic filter option. This should probably be kept very simple, it's not meant to be a reimplementation of SQL, just enough to express common operations you might want to do a little more conveniently.

I haven't exactly figured out the syntax of how this would work. Something like this:

--filter 'acq=2017??01*; file=0000.h5,0001.h5'  # Filter on the acq (first day of month in 2017, either of two file names)
--filter 'acqtype=corr; filetype=log'  # Filter all log files in corr acquisitions

Thoughts:

jrs65 commented 7 years ago

@cubranic @kiyo-masui @ahincks any suggestions on this one?

cubranic commented 7 years ago

I like the logic of how implicit ANDs and ORs would work, it makes intuitive sense and should cover most typical command-line use cases.

It would be great if the wildcards were full glob patterns as we allow in configs, which we could translate to a regular expression (i.e., including “**” for deep subdirectory matching), and use it in Peewee's “regexp” method (http://docs.peewee-orm.com/en/latest/peewee/api.html#Node.regexp http://docs.peewee-orm.com/en/latest/peewee/api.html#Node.regexp) . Postgres, Mysql, and Sqlite all support regular expressions, although each has its own syntax, and I assume Peewee translates its “regexp” call to database-specific SQL syntax.

On May 8, 2017, at 9:13 AM, Richard Shaw notifications@github.com wrote:

At the moment the command line tools (at best) have an --acq option for restricting operations like sync and clean to subsets of the archive. It would be great to replace this with a more generic filter option. This should probably be kept very simple, it's not meant to be a reimplementation of SQL, just enough to express common operations you might want to do a little more conveniently.

I haven't exactly figured out the syntax of how this would work. Something like this:

--filter 'acq=2017??01*; file=0000.h5,0001.h5' # Filter on the acq (first day of month in 2017, either of two file names) --filter 'acqtype=corr; filetype=log' # Filter all log files in corr acquisitions Thoughts:

Filtering allowed on acq, file, acqtype and filetype. Basic wildcards should be allowed (use the SQL LIKE operation). Multiple clause types are implicitly AND (e.g. acq= AND file=) Do we want to allow OR? Maybe becoming too complex. However, multiple alternatives within a clause should be allowed (e.g. acq=acq1,acq2 is acq=acq1 OR acq=acq2). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/radiocosmology/alpenhorn/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkrO55qnG93zXhvnG_eBWzQmhl-dWpJks5r3z8xgaJpZM4NUKyI.

ahincks commented 7 years ago

I like the idea. No strong opinion about using glob or not, but would have reservations if it isn't fully translated/supported by peewee DB's.

I think the implicit AND/OR as Richard has it looks good.

Another approach would be for the argument of the --filter option to literally be an SQL expression that can be plugged right into a WHERE clause. Then alpenhorn doesn't need to do any work. But this may be me thinking too much in direct SQL mode rather than in database ORM mode ...

jrs65 commented 7 years ago

Great. Thanks for the feedback!

Adam, I think Davor is suggesting we use the globre package that we already use to translate extended glob patterns into regular expressions, and then we use peewee's native regular expression support to do the query. That seems pretty reasonable to me, and I think everything should be fully translated.

I think you're always thinking too much in direct SQL mode! :)

I think another option for this would be to just break it out into standard command line arguments, e.g.

--acq="2017??01*" --file=0000.h5 --file=0001.h5  # Filter 1
--acqtype=corr --filetype=log  # Filter all log files in corr acquisitions

I guess there's two advantages to doing this:

Disadvantages: