qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

refactor(lib): Adjust `GetResponse` to be polymorphic #1758

Closed ramfox closed 3 years ago

ramfox commented 3 years ago

There are many different cases that our lib.Get method (and our /ds/get endpoint) are covering. In order for lib.Get to work over rpc, we need to adjust the GetResponse so that we can send over different types of return values using the http/json api.

type GetResponse struct {
    res interface{}
    type GetResultType
}
ramfox commented 3 years ago

After some research & experimenting, I don't think just changing GetResult to be polymorphic is going to answer all our problems.

I'm proposing a few changes that should make Get & it's functionality easier to grok.

taking stock of current expectations

json yaml csv zip script
-- y y n y n
commit y y n n n
body y y y n n
meta y y n n n
readme y y n n y
stats y y n n n
structure y y n n n
transform y y n n y
viz y y n n y
rendered n n n n y

one-off behavior that doesn't exist uniformly over api & cli:

3 main pain points:

The... strangeness surrounding Get & our new dispatchify/rpc pattern, is that whatever get returns from the lib method, is what gets returned over http, rpc, & is what is used when calling the lib method from the cmd package. This is already difficult to wrap your head around as is, when we are dealing with Head components vs the Body component, which are structured differently & have different formatting expectations.

These are the three main points that add even more (and dare I say, an unnecessary amount of) complexity & also don't feel like they belong over an json/post http api, nor rpc:

1) Outfile - why are we permitting folks to potentially create a file on another machine via http? Outfile option should be isolated to the cmd layer. If someone wants a zip file, or a script file, or a body file, or if a user asks us for a way to get a component as a file in the future, via the api, they should be using the sugar endpoints. 2) Formatting - if we pull the formatting worries to the cmd layer (the json api doesn't care about formatting, we just return the interface & rely on NewHTTPRequestHandler to do the marshaling work for us), all we need to keep track of is the BODY format, since we have a special csv case for the body. If the body returns as a slice of bytes, we can just print those straight to the terminal. Otherwise it's some structured data that we can marshal into whatever format we want. 3) returning files using our json api - I think it's pretty weird to have all the script & zip options allowed over the json api. I'm also not sure we want this to exist over rpc, although this will work fine since we are encoding/decoding on both sides of the wire. If we do want it to exist over rpc, then well, I'll just suck it up at this weirdness: if something pings curl -X POST -d '{"ref":"username/dataset","format":"zip}' localhost:2503/ds/get they are going to get... {"data":"SOME_BASE_64_ENCODED_STRING_OF_THE_ZIP_OF_THE_WHOLE_DATASET==","meta":{...meta}}. This seems super bizarre to me. (Secretly this is what happens if you POST to get a dataset body w/ format csv, but since we definitely want to have body over rpc, I'm willing to be okay with the strangeness, especially since over http the default will be json, so a user would have to go out of their way to see this.) I'm proposing we split the zip and scriptFile functionality into their own dataset methods that have denyRPC. These will have corresponding sugar handlers that will set the correct content headers & file names, using the same patterns they currently have over the api. The way to access this functionality via the cli will also not change.

Proposed changes:

type GetParams struct {
    Ref string
    Selector string
    Limit int
    Offset int
    All bool
}

type GetResult struct {
    Value interface{}
    Type GetResultType
}

type GetResultType int

const {
    Head = GetResultType 0
    BodyStruct
    BodyBytes
}
type GetScriptFile struct {
    Ref string
    Selector string
}
type GetArchiveFile struct {
    Ref string
}

bugs:

Take a look at ramfox/refactor_get to see selecting a component/field w/ formats json pretty, yaml for head, and csv for body, work over json/post http, cli, & rpc! Branch is very messy, and doesn't look at the sugar endpoints at all, nor does it implement any strong typing, but it does prove that we can return different formats of body & all different iterations of the other components (including selecting into fields) over rpc, http, & cli.

Arqu commented 3 years ago

This is a lot, but a GREAT breakdown!

Hard to get at all the points at once but can give some input on a few pieces:

ramfox commented 3 years ago

Thank you for reading!!! To your first and second points, my proposal (separating zip & file outputs) would only be in the case where we are okay with not having that functionality over rpc. So yea, it wouldn't make sense to separate these into different lib functions unless that's a functionality we are okay with.

Why do I favor removing access to something like getting a zip or an entire file from rpc? Mostly because the process to get it from the daemon to the terminal process offends me, haha. For this to work over rpc, we will get a slice of bytes from the file at the lib.Get level. Our POST/JSON api will encode the whole file as json as a base64 encoded string (the default behavior of json.Marshal). And then sends the zip file as an encoded string over http w/ content type json. The terminal process that launched the request then unmarshals the params, and the base64 encoded string would return back to a slice of bytes. We can then output those bytes into an outfile or printed to the terminal.

This part, although strange because of the encoding and decoding doesn't bother me. The thing that bothers me is that we will be supporting a non-sensical (to anyone who doesn't understand the need for our dispatch process) endpoint that returns an entire file as a base64 encoded string of a zip or html file.

TBH the more I protest against it the less stressed I am about it. There is just something fishy in my gut when I examine the process, and it felt like it made more sense to just not allow such a weird endpoint (application/JSON response of an encoded zip or html file) rather than have it, but only as sugar, where a user can GET it and it would return in a format that makes sense. Which would mean disabling the behavior over rpc.

Way less offended now about it than yesterday, when the idea of having to document the endpoint was super bizarre to me.

b5 commented 3 years ago

This is great. I'm not going to catch it all, but some initial reactions:

I very much agree with your gut feeling that non-JSON or JSON-wrapped-bytes responses are a bad idea, and I agree that adding polymorphism support doesn't solve that core problem. The more I look at it, the more it seems adding polymorphism would only tangle things up more.

So to that end, I buy a lot of your proposed solutions, but within the framing of thickening "client" code to provide sugar around a core lib.Get method. In practice this means adding code to api and cmd that create end user features from one or more calls to lib.Get. You've suggested something very close to this, but instead of adding methods, define helper functions in lib that wrap an Instance and call the Get method to construct a response, then use those helpers in api and cmd to construct end-user responses.

package lib

// GetZip fetches an entire dataset as a zip archive
func GetZip(ctx context.Context, inst *lnstance, refstr string) ([]byte, error) {
 res, cursor, err :=  inst.Dataset().Get(ctx, &GetParams{ Ref: refstr, All: true })
 // compress result, return bytes
}
package lib

// GetCSV fetches dataset body data in csv format, passing a limit of -1 gets all rows
func GetCSV(ctx context.Context, inst *Instance, refstr string, offset, limit int) ([]byte, error) {

}

We really need cursors to work for this to be clean & easy, but the helper functions will isolate this messy code & keep it in lib

ramfox commented 3 years ago

We can remove the idea of having to support rendered here. But we still need to support getting back the script file of a viz, readme, or transform. None of which are "rendered", but are persisted data, so we still need some kind of GetScriptFile helper function, and this should still come from the /get endpoint and qri get cmd.

Re: helper methods. Sounds good!

Also sounds like moving outfile to just cmd isn't controversial, nor is moving formatting responsibilities to cmd. Basically, lib.Get deals with interface{} responses, and anything else is relegated to a helper function, that is called by cmd or a sugar handler.

b5 commented 3 years ago

But we still need to support getting back the script file of a viz, readme, or transform. None of which are "rendered", but are persisted data, so we still need some kind of GetScriptFile helper function, and this should still come from the /get endpoint and qri get cmd.

I'd rather see those handled with a script selector than a helper function and kept within the main lib.Instance.Dataset().Get method (basically, leave this as it is). scripts will generally be small. To me it makes sense to do these as base64 bytes. I don't think, need the helper function, but you might be catching something I'm missing.

Also sounds like moving outfile to just cmd isn't controversial

yes totally. I very much agree with your sentiment that outfile belongs in cmd.

ramfox commented 3 years ago

No your right it’s totally fine without a helper func. And fully agree about those being a-okay as base64 encoded strings

ramfox commented 3 years ago

After working on this here is where we stand:

Calling lib.Get will in almost all cases return structured data.

There are three special cases that return a slice of bytes: 1) asking for a zip of the dataset 2) asking for the script file 3) asking for a csv body

Because of the way rpc/http marshals and unmarshals the content over the wire, we still need to have two separate fields for any interface{} (structured) return type and any []byte return type. (JSON marshal converts byte slices to a string, in dispatch, when we get the GetResult back from rpc, we marshal that json into a GetResult. If the field we marshal into is just interface{}, it assumes the response is a string (because we don't have anything specifying it should be a byte slice), and it causes side effects because the response is not what is expected.

Since we don't want to separate out these cases into their own methods (yyyy??), we then need to get more organized.

After spending time with the inputs and organization (and now that it is possible to return body as structured data rather than a slice of bytes), and that we've removed some of the formatting & file saving out of get, it's more clear how to streamline the GetParams. Currently, there are two strange interactions.

1) The Format field is now only used when byte data is going to be returned. Format == "zip" and Format == "csv" are the only things that have meaning. 2) We use readme.script or viz.script to indicate in the cli that the user wants to get the script bytes of that component. There are multiple time (3 i believe) where we end up parsing this string and checking if we are looking for a script. This to me means that there is something organizational that is missing.

I'm proposing these be the GetParams

type GetParams struct {
    Ref string
    Selector string

    SpecialCase // ZipType, CSVType, ScriptType, empty indicates normal. 

    Limit int
    Offset int
    All bool
}

It needs a better name. We can also skip the strong typing & use strings, or just have three flags Zip, CSV, Script, that error if more than one is set.

These flag overrides would very clearly let the dev know something different is happening and organize what is expected in and out of the function. All of them return byte slices, all of them have the potential to override other fields. (Zip & CSV override Selector). They have special behavior that isn't "typical" of Get. There would be a clear pattern for if we need to add any other special cases.

This brings us back full circle to our first discussion when refactoring. Rather than labeling & typing the return value, we instead type the input & rely on the input params to know what the expected return is.

I'm still for separating out GetZip, GetScript, GetCSV as helper functions that call inst.Dataset().Get() and having cmd & the sugar handlers call those directly.

b5 commented 3 years ago

Before we "get" started (lawl puns! in 2021!), I'd like to reiterate that the reason we don't want to split up get comes from early users adopting export faster than get, and missing out on the power of get in the process. Get needs to be really powerful, It also needs to be singular. When there was more than one way to fetch data from qri, we had users going to different commands asking for things provided by the other command: "Export can't give me JSON!?!?!?" and "can I just 'get' a CSV file!?!?".

Other projects might blame users for not reading the manual. I think we should blame bad developer experience. Right now we're doing the hard work of building a unified data fetching interface that supports many different use cases. Splitting it up detracts from where we're trying to go, even if it's "just" at the lib level.

Moving on to the concrete parts:

  1. We should keep SpecialCase as Format, and instead of thinking about it as "only sometimes in use", write it as always being used, but with a sensible JSON default.
  2. I don't think the "script responses" deserve the same treatment as the zip and csv format responses. Instead we should acknowledge that base64-encoded strings are valid JSON, and force clients to deal with it, or just convert scripts to raw strings. The important part is to think of script-selector responses as JSON strings.
  3. We should think about making the return type of Get interface{} to force users to write client side code or use lib helper function(s).

type GetParams struct {
    Ref string
    Selector string

    Format  string // output format. one of "json" | "zip" | "csv", default is "json"

    Limit int
    Offset int
    All bool
}
ramfox commented 3 years ago

People got confused because we surfaced multiple ways to Get things to the user. We can have separate methods for the underlying functionality & have one way for users access it. All of the API can stay the same localhost@2503/get/username/dataset/body.csv can still exist, but be handled by an api that calls inst.Dataset().GetCSV, qri get me/dataset --format zip can be routed to inst.Dataset().GetZip in cmd/get.go. We do this for other commands. But again, this is not the hill I'm dying on, lol, I've made my peace.

1) The thing is, we arent returning json. We are returning go-native structures in all the straight forward cases, and returning a slice of bytes in all the special ones. We marshal to json (or yaml, or prettified json) at the cmd, api & rpc levels, but not at lib. 2) Script responses are valid json. This wouldn't change that, it would just help us have clearer code. Also, readme.script has no meaning at the lib level, .Script is not a valid field on. It's a weird special case, we should handle it as such. 3) because of RPC & json marshaling we can't do this. We need both interface and byte slice options. For RPC json unmarshal will not properly decode any base64 encoded string if we have it unmarshaling into an interface{}, because it will just assume the base64 encoded string is a string & unmarshal to that. We need the response to indicate a slice of bytes. We can also have multiple checks & decode the string ourselves, but that seems messy to me. This is part of why I think the functions should be separate, but again lol not dying on this hill. We jsut still need the return type to be

GetResult {
  Value interface{}
  Bytes []byte