sidora-tools / sidora.cli

A CLI for terminal based data extraction and summary for the MPI-SHH Department of Archaeogenetics PANDORA Database
Other
0 stars 0 forks source link

Discussion thread on key use cases, interface rearrangement #4

Open stschiff opened 4 years ago

stschiff commented 4 years ago

We currently have this very nice and flexible module-based approach, where I can say ./sidora.R -m progress_table and it creates a progress table. However, it's a bit funny that - for example - the project filter is a global option. I think that design will create problems later.

For example, here is an exploratory workflow that I can imagine might be useful: 1) query Pandora quickly for all available projects (because you don't know by heart how you've named your favourite project again) 2) quickly list how many sites/samples/individuals or so there are per project (perhaps you have two projects with relatively similar names and you want to quickly double check the raw numbers of sites in each) 3) Then select a specific project and output the full progress table for it. 4) Print the progress table again with selected columns 5) Create a pdf or html report from that customised progress table to send to your team, say.

here, 1-3 would not allow for a project selection (because you are printing all), while 4-6 involve selecting a project.

I would therefore suggest we aim for a more controlled, less free, approach to the interface, where we have subcommands of the sort

./sidora.R list --projects -> listing all projects ./sidora.R list --projects --withStats -> as above, but with some key numbers summarised (nr of sites, individuals, ...) ./sidora.R list --tags -> listing all tags, with optional --withStats ./sidora.R list --sites -> listing all sites, with optional --withStats ./sidora.R view --project=XX -> list progress table ./sidora.R view --tag=XX -> similar ./sidora.R view --project=XX --columns=X,Y,Z -> show only the selected columns ./sidora.R view --project=XX --columns=X,Y,Z --output=html -> create report

So here I can see we already need list and view subcommands, but I'm sure we'll have a lot more in the future. But this would go away from the very flexible module-based approach, but make a more restrictive set of sub-commands.

In terms of internal design, I think we should design this repo as a proper R package. So all functionality shown above from the command line should be also available as simple R functions (called for example sidora_list(...) and sidora_view(...)). Then, the CLI script would simply call those functions. Thereby we have covered both the interactive, programmatic use-case within R, and also the more immediate bash-based approach.

Destroy.

jfy133 commented 4 years ago

Actually, I think it's a good idea, it streamlines a lot and indeed fits with a 'cli' like tool, so probably would be clearer to a user.

One additional note (although this can come later), is often I only need to build a report for archaeologists of a single (precious, in all cases ;)) sample.

So, if I understand correctly, for the view function, we could have a --individual (instead of --project), and that will then have a different sub function to make a single-indiviudual report? Is that correct?

stschiff commented 4 years ago

Yes, that's possible. I also think we should make it possible to select a specific site, or in fact multiple sites. And with respect to tags, I think we should allow a system where we list - say everything for which a specific tag is set at a specific level. So perhaps something like ./sidora.R view --tag myTag --tagLevel Site, which would say: Show me everything that has "myTag" at the site level.

I think we need to come up with a general selection grammar or something... well, one step at a time.

nevrome commented 4 years ago

Let's revive the sidora hackhour on Friday. "Typical" workflows as the one outlined above will help to narrow down, what this interface should do.

The idea to transform this repo again to an R package is good, but I believe it will be difficult to write functions that are both useful for command line data exploration and R data analysis. Within R you want tidy data structures, on the command line you want easily digestible and well readable output (and I want cool ascii plots). IMHO we should do one thing good and not try to support two different interfaces at once.

R users should rely on the core package only. Probably our work on the cli-backend-package will reveal new functions that should be part of core. Beyond the differences in output and purpose it's also confusing to require the user to effectively use two packages for exploring Pandora. One vaguely general and the other vaguely more specific.

jfy133 commented 4 years ago

I unfortunately am finding it more and more difficult to find time to join the hackathon at the moment due to continued childcare and piling up deadlines :.

But I see what you mean. Why couldn't even the 'specific' reporting functions also just not go in sidora.core? I suspect only people who want to get into the nitty gritty would go into sidora.core anyway.

stschiff commented 4 years ago

OK, you've convinced me about the separation of keeping the R API in sidora.core, and focus the cli only on bash usage. Sounds all good.

jfy133 commented 4 years ago

Clemen's and I decided to try and solidify more of the design decisions, so will make a draft here.

General Overview

sidora.cli will have a verb - noun 'like' grammar.

e.g. view -> project/site/sample/capture list -> site/project/site/sample/capture summarise -> sample/project/tag

etc.

'Verb' Module Descriptions

List

Simply gives a list of each entity of a given criteria in a row-wise fashion.

E.g. I want all sites for a project:

AAA ABB ABC

View

Provides all information for a single 'row' of a pandora table. This is essentially all the information that is displayed when on the Pandora Web UI.

image

Summarise

Gives summaries (totals, means, maps, lists etc.) of a given noun.

For example: This project has 10 sites with 40 samples.

The samples are from these countries, with LAT:LON on a fancy ascii nerd-map.

Tabulate

This provides all information of a multiple entries of a pandora table in a TSV format.

Default displayed is a markdown table.

A export function would allow exporting as a TSV file.

Site Sample ID Type
AAA AAA001 21
ABB ABB001 23

Report

tbc.

stschiff commented 4 years ago

So what "noun" means depends on the Verb:

jfy133 commented 4 years ago

Correct.

Summarise - good question @nevrome ?

nevrome commented 4 years ago

Since my work with ruby I have a deep-rooted dislike for every approach that forces context-sensitive plurals. I vote for only singulars everywhere.

stschiff commented 4 years ago

That's OK, I more meant conceptually... not clear to me whether summarise takes a type or an entity. With respect to list, I would suggest to also show some property columns for each entity, right? Like "Name", "Country", "Locality" per site... In the end we'll see whether we need to even have a summary command, depending on how fast/slow that is. But certainly OK to have it for now.

nevrome commented 4 years ago

Each of these modules (except list) takes an entity type and an entity id. What you suggest for list is now part of tabulate. I think we should quickly talk about it in our debriefing session later.

nevrome commented 4 years ago

ToDos:

  1. Fill the modules with life (as they are right now)
    • View (James)
    • Summary (Clemens)
    • Tabulate (maybe Thiseas?)
  2. Basic filter abilities (for the tabulate module?) -> maybe compare: advanced search of pandora webinterface

Focus: Data driven analysis approach

jfy133 commented 4 years ago

General Roadmap discussion:

WebApp will be the focus for non-bioinformaticians, e.g. to provide smmary statisics for PIs; lab tracking (e.g. progress tables) for lab techs.

James will finish simple tasks (view) then move back to Web app.

Clemens will work on summary, which will also feed into later work for James and Stephan when we start developing the report module.

nevrome commented 4 years ago

I fell in love with this extremely neat, documentation string based CLI interface definition with docopt. How would you like an interface like this, @stschiff and @jfy133?

sidora.

Usage:
  sidora tutorial [options]
  sidora examples [options]
  sidora glance <entity_type> [options]
  sidora view <entity_type> <entity> [options]
  sidora summarise <entity_type> <entity> [options]
  sidora list <entity_type> (<entity>... | <filter_entity_type> <filter_string>) [options]
  sidora tabulate <entity_type> (<entity>... | <filter_entity_type> <filter_string>)  [--as_tsv | --as_pandora_upload] [options]

Options:
  -h --help Show this screen
  --version Show version
  --human-readable Todo
  --credentials=FILE Todo [default: .credentials]
  --cache_dir=DIR Todo [default: ?]
  --empty_cache Todo
jfy133 commented 4 years ago

That looks really nice! Much more what I would be familiar with!

stschiff commented 4 years ago

Looks beautiful, indeed!