okfn / opendataeditor

The Open Data Editor (ODE) is a no-code application to explore, validate and publish data in a simple way. Forever free and open source project powered by the Frictionless Framework.
http://opendataeditor.okfn.org
MIT License
183 stars 23 forks source link

Problem when opening a file originally in xls in csv format from Google Sheets - Discussion/Research? #422

Closed romicolman closed 3 months ago

romicolman commented 5 months ago

Problem description

If you import a dataset in xls in Google Sheets with any kind of problem that the ODE would detect as an error and then you download the dataset in csv to open it in the ODE, the app will detect different issues depending on the format.

Steps to reproduce it

Comments

When exporting the file in xls, the ODE will only detect blank labels for columns.

Captura de pantalla 2024-06-11 a la(s) 10 46 13 a  m

When exporting the file in csv, the ODE will inform the user about problems in the type of data in columns.

Captura de pantalla 2024-06-11 a la(s) 10 45 57 a  m

Is there any explanation for this or is it a bug?

guergana commented 4 months ago

Hello, @romicolman I can reproduce the issue more or less. This is what i get. The errors are the same in the report but I don't get those extra empty columns on the right like in your screenshot.

Image

CSV test

I know why the error is happening. Exactly this column is a decimal number separated with a comma. Since the csv by definition uses the comma to separate the different columns, Google sheets exports the comma separated values as a string (you can see in the csv that the numbers in this column are wrapped with "", example "3,04" for the first column). Ideally we would have the data exported as value,value,3.04 with the decimal separated by a dot, but we can't control much how users set up formatting in google sheets so we have to find a solution to this exact issue.

Frictionless is detecting a field with this format "number,number" as a geojson column type for some reason, when it should be detected as a string because the format in the csv is a string originally. Ideally this case should be detected and imported as a number... or even better, i think at some point the user should be asked what the schema type of the column is when importing these ambiguous types.

To make the error go away for now you should set the columns schema field value to string, (at the start it is set to geojson):

Image

But even this is not exactly what we want because this is exported as a string by google sheets but we should have the option to read it as a number and currently there is also an error in ODE if you change the schema type to number. You found a very particular case, lol.

I don't think the report message is particularly helpful for finding out what the problem is. Probably many non tech users wouldn't be able to figure out why the column is creating an error so I think this is a good opportunity as well to suggest more detailed error messages for type errors.

To think of a solution to this issue we need to know why it was designed like this. Why is this specific format imported as geojson? I think @roll can help us with that.

@pdelboca what are your suggestions for an issue like this?

romicolman commented 4 months ago

Hello, @romicolman I can reproduce the issue more or less. This is what i get. The errors are the same in the report but I don't get those extra empty columns on the right like in your screenshot.

Image

I know why the error is happening. Exactly this column is a decimal number separated with a comma. Since the csv by definition uses the comma to separate the different columns, Google sheets exports the comma separated values as a string (you can see in the csv that the numbers in this column are wrapped with "", example "3,04" for the first column). Ideally we would have the data exported as value,value,3.04 with the decimal separated by a dot, but we can't control much how users set up formatting in google sheets so we have to find a solution to this exact issue.

Frictionless is detecting a field with this format "number,number" as a geojson column type for some reason, when it should be detected as a string because the format in the csv is a string originally. Ideally this case should be detected and imported as a number... or even better, i think at some point the user should be asked what the schema type of the column is when importing these ambiguous types.

To make the error go away for now you should set the columns schema field value to string, (at the start it is set to geojson):

Image

But even this is not exactly what we want because this is exported as a string by google sheets but we should have the option to read it as a number and currently there is also an error in ODE if you change the schema type to number. You found a very particular case, lol.

I don't think the report message is particularly helpful for finding out what the problem is. Probably many non tech users wouldn't be able to figure out why the column is creating an error so I think this is a good opportunity as well to suggest more detailed error messages for type errors.

To think of a solution to this issue we need to know why it was designed like this. Why is this specific format imported as geojson? I think @roll can help us with that.

@pdelboca what are your suggestions for an issue like this?

Hi @guergana! Let's wait for @roll 's comments, but in the meantime I think we need to I'll move this to sprint 6 so I can document it properly for the user guide if needed.

guergana commented 4 months ago

@romicolman As suspected the data package standard format for geopoint is "lat, long" https://datapackage.org/standard/table-schema/#geopoint so this is not an issue per se with the app, we need to figure out how to make clear that there is this compatibility issue between google sheets when exporting csvs or find a way to mitigate it.

guergana commented 4 months ago

@romicolman I managed to see the problem exporting in .xls now with the extra columns. I had not tested that part. My bad. 😬 Only the csv part is finished.

XLS test

I have discovered the issue in the .xls file:

When you download the file, if you scroll to the right you can see there is some formatting underlined in some of the columns to the right (refer to the screenshot). It seems that to preserve that formatting, the .xls is then exported with those fields marked as empty, so that you can see these lines if you open the document again. I don't believe this is a problem with the ODE, this is an issue with the exported file. In any case we could try to ignore all empty columns on importing to avoid cases like this, but I wouldn't see this is a priority for the release because the file already comes with this issue.

Image

To fix the issue:

The empty columns should be gone, and also the errors when importing the file into ODE.

I see this more as something to make clear in the quality that is expected of the files provided by the users, something to be mentioned in the User Guide, than as a technical problem on our side.

romicolman commented 4 months ago

Hi @guergana! I agree. This is something we would need to document properly for the user, but we won't work on this for the stable version. I'm moving this issue to the next sprint so to write about this for the user guide.

romicolman commented 3 months ago

@guergana I will add a new section to the user guide called Key recommendations when uploading data to the ODE to incorporate this and other things users should keep in mind. Here is the suggested text:

_Before uploading a file to the ODE…

What do you think?

guergana commented 3 months ago

@guergana I will add a new section to the user guide called Key recommendations when uploading data to the ODE to incorporate this and other things users should keep in mind. Here is the suggested text:

_Before uploading a file to the ODE…

* Make sure your file is well-organized: each column must contain a name, there should be no extra rows before the table, or additional elements (like borders or lines) next to the space where the tabular data is located, etc.

* Check that the tabular data does not contain cells that are merged. Data producers from Governments, international organizations and internal reports usually add tiles,  descriptions, graphs within sheets, like in this [case](https://stat.link/vtenks). If there are extra elements in your file, the ODE will ingest your file and show you multiple errors.

* If you export a file from Google Sheets in csv and you have columns with numbers, please, make sure to use “.” for decimals, instead of  commas. Otherwise, Frictionless, the code working behind the ODE will interpret the content of your cells with numbers as text._

What do you think?

Hi, great work, I have some observations:

First point

For the first point I would change a few words, though, instead of "well-organized" I would say "well-formatted", also I would remove the etc at the end. Etc adds ambiguity and I don't think it's a good idea to have this work in guidelines or docs. This lets the user assume what the etc can be, let's instead provide a complete list of badly formatted cases.

Second point

In the second point, I don't understand what is meant by "ingest" in "the ODE will ingest your file". Do we have maybe another word for this?

Third point

Now what do you think about my observations @romicolman ? :stuck_out_tongue:

romicolman commented 3 months ago

Perfect! I integrated your feedback, Guergana. Thanks a lot!