qri-io / 2017-frontend

qri electron & web frontend
https://qri.io
MIT License
23 stars 7 forks source link

running save on a csv file with structure.formatOptions.headerRow: true causes the editor to eat a row on each save #436

Open b5 opened 5 years ago

b5 commented 5 years ago

kudos to @rgardaphe for uncovering this one. We've laid some groundwork for fixing this on the backend. The solution we've discussed internally is to make sure we're sending inlined, JSON body data in the POST request.

ramfox commented 5 years ago

Okay, so working on this now.

First, wanted to confirm that this eating row issue wasn't happening on the cmd line as well:

using a body.csv:

String, Int
one, 1
two, 2
three, 3
four, 4

qri save --body body.csv me/test

running the command immediately again gets a no changes detected error, as we expect

On the frontend, since we (currently) have to have a structure in order to specify that we want to save as csv, I want to test that having a structure is not what is causing the issue:

I exported the dataset, and used the dataset.yaml file to create a new version of the dataset. qri save --file dataset.yaml me/test qri save -f dataset.yaml --body body.csv me/test body correctly returned no changes detected

Now trying the same tests using the api, and the curl -F flag:

curl -X POST -F 'body=@body.csv' http://localhost:2503/save?name=new_test

Gets created correctly.

qri log me/new_test
Jan 14 21:36:47 - /ipfs/QmdQB2gumtZfBncXcuY5KfSzvfudjmRPcn93inujvPNpNo
    created dataset

and rerunning the command errors with "no changes detected"

curl -X POST -F 'file=@dataset.yaml' -F 'body=@body.csv' http://localhost:2503/save?name=new_test

also has no changes detected

basically, spent a bunch of time confirming that our backend code and our api are not the issue and that just sending as json (although I'm a fan of this, not knocking it) isn't going to solve the problem!

The issue is the way we display/save data in the editor: when we have the 'header' flag set, we don't display the header row in the editor and we don't keep the header row data with the body data. The editor then send the data without the header row, so we are sending over a truncated body for the backend to save!

ramfox commented 5 years ago

The first is a frontend issue that will end the major bug! The table view (handsontable editor) does not display the header row if the schema is not detailed (aka, no "title" in the column details), even if the structure indicates that the dataset has a header row. Also, even if a schema is provided with enough detail to let us know that there is a header row, in no case is the header row ever shown in the editor.

The second issue is that if the schema is given, and is not specific, no more specific schema is ever created to preserve the header row details, if the first body is saved over.

The third is that the frontend does not have a way to get the header row from the body, since the /body end point (and the qri body cmd) do not return the header row.

To walk through the issue and how the problems interact:

Let's say we are on the frontend and want to save. If there is a schema provided, but it is one that does not specify the header rows, eg:

{"type":"array"}

(Which is the default on the frontend) it is enough to block schema generation.

The dataset is saved with the header row, the too-general schema is saved. We will actually be okay from now on, as long as we don't use the frontend.

But let's say we are using the frontend. When we "get" the body on the frontend, we get all the "non-header" rows. The frontend gets all the header row info from the schema. But, if the schema doesn't provide any "title" info for each column, we don't ever display the header row on the dataset page.

Then we move the the editor. The editor gets the body info, and no header row is shown/given, because the get body didn't include it. We save. The data is one row less, because the header row was never sent to the backend. We save.

When we "get" the data again, since the "get" functionality does not send over the header row, the backend sends us a dataset with one less row. and so on!

Here is what I know: the frontend and the editor need to show that there is a header row specified, even if the schema does not have it.

The underlying, more tricky issue, is if the get command doesn't return the header row, and we don't have a good schema, we will eventually write over the header row, even if the display/editor problems are fixed.

Follow me for a moment.

We fix the bug that doesn't display header rows on the frontend in the editor. We save a body that has a header row, but the schema we provide is basic and does not give column names. Our body saves correctly, eg lets say its:

String, Int
one, 1
two, 2
three, 3

And all 4 rows are saved. We are okay for now.

But let's say we want to make more changes on the frontend. When we "get" the body on the frontend, the header row is not provided:

one, 1
two, 2
three, 3

Since in this hypothetical world, we have fixed the above bug that doesn't allow for a header row in the editor, we have a header row but it's empty, since the schema does not provide any header row info.

one, 1
two, 2
three, 3

Even though the user hasn't made any explicity changes to the dataset, they will still be able to save, since it is different then the last version (the header row was blanked out). When we save this now, we send the data:

_, _
one, 1
two, 2
three, 3

We have unintentionally changed the underlying data, which feels like a big no no.

A weak schema can cause changes to the underlying data. This shouldn't happen. Can be fixed either by forcing schemas to be determined if we are ever in updating or converting to a csv. The more detailed schema always wins, regardless of whether it is user submitted. Equally detailed schemas defer to the user.

Or, when we "get" data (at least for the frontend) we send back the whole dataset and we don't remove the dataset header. The frontend knows to display the header differently by knowing if HeaderRow is set to true. If there is a schema, and it provides a "title" section for each column, then we display the specified titles for each column, but when we send data to the backend for saving, we send the original titles. The new titles are preserved in the schema. This ensures the underlying data never changes.

ramfox commented 5 years ago

Trying to force the frontend to respect leaving space if there is a header row is causing problems. Brain is fried going to pick this back up tomorrow.

dustmop commented 5 years ago

The first is a frontend issue that will end the major bug! The table view (monico editor) does not display the header row if the schema is not detailed (aka, no "title" in the column details), even if the structure indicates that the dataset has a header row. Also, even if a schema is provided with enough detail to let us know that there is a header row, in no case is the header row ever shown in the editor.

Whether the header row is displayed or not is orthogonal, it's simply a UX choice. If the frontend knows that the header row flag is set, it doesn't necessarily need to display that, nor the specific header row, it could just prepend the header row after retrieving the editor's body and before posting to /save.

The third is that the frontend does not have a way to get the header row from the body, since the /body end point (and the qri body cmd) do not return the header row.

I think it's a good thing that the /body endpoint and qri body don't send the header row in their response; the header row is less like actual body stuff, and more like metadata structure that describes the format of the body (csv is weird like this). That is, it's a feature not a bug :D

But let's say we are using the frontend. When we "get" the body on the frontend, we get all the "non-header" rows. The frontend gets all the header row info from the schema. But, if the schema doesn't provide any "title" info for each column, we don't ever display the header row on the dataset page.

Trying this myself, I was unable to get into the state described here. I created a csv based dataset using the command-line:

qri save --body=names.csv me/names

Where names.csv has a header row, looking like this:

Person,Age,Height
Joe,26,181
Tom,42,174
Carl,38,170

When I look at this in the frontend, I see those three non-header rows in the body, and the Structure page shows:

{
  headerRow: true
  lazyQuotes: true
}

and the Schema is:

{
  items: {
    items:
    [{
      title: person
      type: string
    }, {
      title: age
      type: integer
    }, {
      title: height
      type: integer
    }]
    type: array
  }
  type: array
}

Correct me if I'm wrong, but couldn't we use this to generate the header row, prepend it to the body, and then pass that whole amalgam down to /save? Of course if the schema isn't given (or is a minimal "{type: array}") it's not possible, but to me that sounds like another bug.

This is all assuming we punt on the "json everywhere" idea for now, which I think we agree is too much at the moment.

ramfox commented 5 years ago

Correct me if I'm wrong, but couldn't we use this to generate the header row, prepend it to the body, and then pass that whole amalgam down to /save? Of course if the schema isn't given (or is a minimal "{type: array}") it's not possible, but to me that sounds like another bug.

yes this is basically the crux of the issue. In all the cases I outlined, if we have a good schema, we are fine. If we don't have a good schema, then there is weird behavior.

unfortunately, the frontend was by default sending a not very good schema and there was no way to delete the schema from the frontend. This is fixed, but currently, there are errors if you are saving in csv and you supply a structure but don't supply a schema. Which makes no sense cause we can derive a schema. Just need to fix that on the backend, before all of this works smoothly.

In good news, while working on this, I refactored our editor state tree to make some changes which i think make the editor changes a little more intuitive. Hopefully making it easier to modify in the future.

I think it's a good thing that the /body endpoint and qri body don't send the header row in their response; the header row is less like actual body stuff, and more like metadata structure that describes the format of the body (csv is weird like this). That is, it's a feature not a bug :D

Yea definitely, not a bug, but + a bad schema + headerRow true = causes the row eating problem, just was listed as one of the points that adds up to the bug, not a bug itself.

Whether the header row is displayed or not is orthogonal, it's simply a UX choice. If the frontend knows that the header row flag is set, it doesn't necessarily need to display that, nor the specific header row, it could just prepend the header row after retrieving the editor's body and before posting to /save.

totally, and this is fine if there is a good schema. But we weren't handling the case where the schema was bad before (now we are, which is what is bandaiding over the major row-eating bug). Just trying to point out that if there is a bad schema, and and we save a second time, even if we don't change anything else, it will save because the header row is now different (blank), which is a less breaking more insidious bug.

dustmop commented 5 years ago

unfortunately, the frontend was by default sending a not very good schema and there was no way to delete the schema from the frontend. This is fixed, but currently, there are errors if you are saving in csv and you supply a structure but don't supply a schema. Which makes no sense cause we can derive a schema.

Just trying to point out that if there is a bad schema, and and we save a second time...

I don't understand how we end up in this situation (a bad schema) on the frontend. Is this what happens when a user first creates a csv dataset? Or does it require the user manually erasing the schema using the editor? If it's the former, then yeah, it seems like we should ignore or disallow the frontend's bad default structure because it's conflicting with the headerRow.

yes this is basically the crux of the issue. In all the cases I outlined, if we have a good schema, we are fine.

But this isn't happening yet, right? You're just saying the fix is possible, by generating the headerRow from the schema at save time, like described above, yes?

Just need to fix that on the backend, before all of this works smoothly.

What is the specific backend fix? That the headerRow-derived schema should override the frontend's incomplete schema?

ramfox commented 5 years ago

What is the specific backend fix?

Right now, if you have any piece of the structure, so like FormatConfig.HeaderRow or even just Format (to specify csv or json), and you have no schema, the backend will not create a schema for you and will error because csv data format requires a schema

However, if we do not pass any structure, as we normally do on the command line, the whole structure will get generated for you, including the schema.

(If you use qri save --file dataset.yaml --body body.csv, and only fill out one part of the structure (for example, format: csv) it will error invalid dataset: structure: csv data format requires a schema, that is how you can replicate the error on the command line).

In the frontend, we've tied up the user choosing the format (and other things that really only effect the way things are displayed in the frontend), in with the structure. So if the user chooses any of these options a structure is created and sent over. Also, a default (super general) schema is created. You save the dataset, and if you've sent over the option "HeaderRow: true" and you have a bad schema, the frontend was dealing with it improperly.

But, if you remove the schema on the frontend, the user would logically assume that the backend could create a schema for you. (They way it would if you had not put any structure). But instead it errors because invalid dataset: structure: csv data format requires a schema.

So we create a weird situation for the user.

If they want to specify ANY structure information, they are now required to learn how to write a json schema, because if they don't write a good one there might be some funky behavior. Or they don't specify structure information on the frontend (which we rely on for formatting right now so they can't really do that) and get a nice detailed schema.

I have a few ideas and thoughts about what to do about separating the "formatting" choices on the frontend with the structure choices on the backend, which I've already started implementing.

First, I've removed the idea that csv == table and json == code view. I'm toying with the idea of removing the option to specify csv or json at all, and just always submit as json. Also getting us one step closer to what this issue originally describes: let's just always talk in json. But that raises some issues about schema generation, since we only really get cool schema generation with csv. This might be totally fine for now.

Second, the header row of a table is not the first row of the table, the way it was previously in the editor. The header row is the column title row, which can't be edited the same way that the other rows can. This specifies to the user that the column row is special, because it is actually part of the schema, not the body. If you want to change the column row, you need to edit the schema on the structure page, not the body on the body page.

The third thing I want, but haven't finished yet, is to create a basic schema when you have "column row" enabled. Part of why I haven't done this yet is I'm not totally sure the best way. The easiest way, that I can see, is, since the header row option is only available when the data is tabular, and you can't be on the tabular data page unless the top enclosure of your dataset is actually an array, so you can create:

{
  "type":"array",
  "items": {
    "type":"array",
    "items": [
      {"title":""},
      {"title":""},
      {"title":""},
      etc, for the number of columns in the array
    ]
  }
}

This would have the effect of a column row with the display names "A", "B", "C"... etc, but not have any title information saved in the schema.

They can then go to the schema page to edit this schema, OR, (and this is what I would really want), a function that lets you change the column names from the body page (by right clicking and choosing "rename column"), which then changes the schema. That way the user doesn't need to know how to navigate a json schema (but can still see if when they go to the schema page).

This would help relieve the issue that sending over json means you basically don't get schema generation.

Alternatively, and probably long term this is better but I don't think it makes sense with our current resources, we have another endpoint /schema that takes a body and sends back the schema. This is the reverse of validate. In validate, we send a schema and body and see if the body fits, in schema, we send a body and return a schema that would make that body valid. We only send it when the data is parsable. The schema would get updated as you go. You would send over any column name information as an input or maybe in the body, so the schema would still include the proper user specified information. This would sort of require that the schema get better at reading json data, unless we send over the data as csv (if it's tabular).


But, as long as we make the backend fix to generate a schema if one was not specified (rather then just error-ing that no schema was specified), then we are actually good on the row eating issue, if not good on the greater issue at large (sending json only and good/easy-for-user schema generation.).

Depending on what we are into for schema generation (just frontend or using a backend endpoint), I can write an rfc. Or even two: one detailing what we want to do short term and one detailing what we want to do long term.

dustmop commented 5 years ago

Right now, if you have any piece of the structure, so like FormatConfig.HeaderRow or even just Format (to specify csv or json), and you have no schema, the backend will not create a schema for you and will error because csv data format requires a schema

However, if we do not pass any structure, as we normally do on the command line, the whole structure will get generated for you, including the schema.

I was going to push back on this, because as per value cascade we shouldn't have automatically derived information override user-specified data, but the situation is more complicated. With regards to Dataset.yaml / Dataset.Structure / Dataset.Structure.Schema it's not clear exactly where the line is drawn. As you note, we can have a full Dataset definition minus the Structure, and the backend will automatically derive the Schema, yet we don't do the same for a full Dataset definition minus Schema. The distinction is arbitrary.

I'll try out a PR to expand base.InferValues to fix this.