qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

Need a documented example meta for command-line users #1165

Open dustmop opened 4 years ago

dustmop commented 4 years ago

Encountered during School of Data. Some command-line users were following along with the demo, and when they saw the Meta Editor in Desktop, but didn't know the allowed fields. The best I could do was to show them the source file for dataset/meta.go, but it would be especially nice if there was a page or something that is linked to from the command-line. Maybe we could have a generator for meta, like qri example meta.json, or an example to display in qri help meta.

feep commented 4 years ago

https://github.com/qri-io/dataset/blob/master/meta.go

uhLeeshUh commented 4 years ago

To add to this, user Paul Crickard in Sprint R offered similar feedback:

“I don’t know if I missed it, but remember my data name [dataset title, aka "no title"] was blank. I saw where you can add a title and description in the creating a readme file section but did I miss that step in the csv part?”

“I would have liked to see, maybe after QuickStart, the options for metadata like the license, keywords, and other fields. I saw some docs that showed the earthquake sample and copied it.”

uhLeeshUh commented 4 years ago

Happy to take a stab at this one! I personally like the qri example meta.json approach so there's a delineation between help on the CLI commands vs. examples of dataset components

dustmop commented 4 years ago

This just came up for me; I am working on a dataset and it would be nice if I could generate an empty meta with all the field names but empty values. Any thoughts on:

qri get --example-meta

So I could run:

qri get --example-meta > meta.json
dustmop commented 4 years ago

Another possibility, which leaves room for future expansion of this feature:

qri get meta --example

This reminds me of https://github.com/qri-io/qri/issues/1532, since we're using flags to "imitate" part of a dataset that doesn't really exist, and then using get to retrieve some other part.

b5 commented 4 years ago

not a fan of --example-meta. To me it's a one-off solution to a general problem of default values.

From a user perspective, I'd much rather see something in the vein of invoking get without a reference & relying on a sane default.

qri get meta > meta.json

@uhLeeshUh mentioned above that part of the problem our users are running into is not having an obvious way to get meta. Even though --example-meta is clearly named. I'm concerned it's a single flag for a single question. It shows the user how to get a blank meta component, but that's it. qri get meta on the other hand, invites the user to think about what else they can type after get to look at other default components.

To me this is clean & easy. Having qri get meta spit out a blank meta file would be useful. But, there are three responses competing for this same invocation:

On #1532 issue we've proposed:

qri get schema --body file.csv

That proposal also needs to contend with what happens when that command is invoked in an FSI-linked directory. So I think we need some formalization here. and for that I'd start with a clearer definition of what FSI linking does to commands:

an FSI link changes the default dataset reference from the empty dataset to files with special names in the current working directory

And the much larger change I'd propose we formally define the empty dataset as a dataset with all components existing at their default ("blank") values:

{
  "qri": "ds:0",
  "commit": {
    "author" : "",
    "message" : "",
    "path" : "",
    "qri" : "cm:0",
    "signature" : "",
    "timestamp" : "",
    "title" : ""
  },
  "meta": {
    "accessURL": "",
    "accrualPeriodicity": "",
    "citations": [],
    "contributors": [],
    "description": "",
    "downloadURL": "",
    "homeURL": "",
    "identifier": "",
    "keywords": [],
    "language": [],
    "license": "",
    "path": "",
    "qri": "md:0",
    "readmeURL": "",
    "title": "",
    "theme": [],
    "version": ""
  },
  "readme": {
    "format": "",
    "path": "",
    "qri": "rm:0",
    "scriptBytes": "",
    "scriptPath": "",
    "renderedPath": ""
  },
  "viz": {
    "format": "",
    "path": "",
    "qri": "vz:0",
    "scriptBytes": "",
    "scriptPath": "",
    "renderedPath": ""
  },
  "transform": {
    "config" : "",
    "path" : "",
    "qri" : "tf:0",
    "resources" : "",
    "scriptBytes" : "",
    "scriptPath" : "",
    "secrets" : "",
    "syntax" : "",
    "syntaxVersion" : ""
  },
  "structure": {
    "checksum": "",
    "compression": "",
    "depth": 1,
    "encoding": "",
    "errCount": "",
    "entries": "",
    "format": "",
    "formatConfig": "",
    "length": "",
    "path": "",
    "qri": "st:0",
    "schema": "",
    "strict": ""
  },
  "body": [],
  "bodyBytes": "",
  "bodyPath": "",
  "name": "",
  "path": "",
  "peername": "",
  "previousPath": "",
  "profileID": "",
  "numVersions": 0
}

Internally we'd conceive of an empty get as being returning the value of a newDataset() function with no arguments.

With these changes, introducing our data model would get a lot easier on the CLI side. They can now explore the qri dataset model using an empty get command:

$ qri get 

Or understand what readme does a little better with:

$ qri get readme

Lots to think about here. This would move a lot of parsing complexity into this theoretical dataset constructor, but I'd love to get your top-line thoughts on changing the default return values of get!

dustmop commented 4 years ago

Using a non-existent reference to mean "the empty dataset" runs into problems. What if we take a cue from languages like go and use _ to mean that. So qri get _ or qri get meta _.

b5 commented 4 years ago

Can you tell me more about what you think those problems are? I know we're way off topic here, but this is a really useful conversation as we turn towards spec writing

dustmop commented 4 years ago

Sure thing, that's a good suggestion (instead of me just asserting as much and leaving it as an exercise for the reader)

Generally, I think there's value in allowing a way to represent a list of zero datasets that is distinct from a list of one dataset which is empty. In many cases, queries about one or the other will have the same answer, but this is not always true. For example, it's true that "what's the total number of rows" is the same in either case. But not "do all my datasets have structure.format == json", which will be (vacuously) true for the empty list, but false for a list of the empty dataset.

There used to be a feature for get where it could take multiple references. That feature is currently broken, but it should probably be restored and regardless, RefSelect is written in a way as to assume the general case of N references that can be given to a command. It makes it easier to reason about commands when the number of inputs matches the number of values that are operated on, especially for external programs that may be written on-top of qri (like the python wrapper).

Yes, FSI is a special case that breaks this rule, but that's because it's designed for the use case where you're in a terminal with pwd set to the working directory, which strongly signifies intention for what the empty reference means. External consumers shouldn't run into this situation.

We also have the "use" functionality, which may be a bit outdated and not often talked about, but it's another case of conflict with what an empty reference is meant to signify.

In both the FSI and use situations, there may be cases where a user still wants to refer to the empty dataset, but if it is using the same syntax as "this dataset", it's not possible to express.

I'm also just a fan of making errors explicit, so that mistakes when writing commands leads to immediate feedback, instead of the program trying to be too smart for it's own good, which often leads to confusion when users don't know about the underlying magic. Treating no reference as the empty dataset reminds me of programming languages that have hashmaps where it's hard to determine if a key does not exist in that map versus it existing but with a nil value. That ends up being the source of a lot of bugs.

One last point is about some UX decisions we made with working directories. A dataset in FSI land does not have a commit component, and leaves out most of structure, in both cases because the user is not meant to input these values, so there's no need to edit them. In addition, FSI datasets won't display meta or a readme unless a one exists. So even the "this dataset" concept for FSI has distinct behavior from the empty dataset, since I think there we want all field names to show up even though they have empty values.