okfn / opendataeditor

No-code application to explore and publish all kinds of data: datasets, tables, charts, maps, stories, and more. Forever free and open source project powered by open standards and generative AI.
http://opendataeditor.okfn.org
MIT License
149 stars 18 forks source link

Problem when opening a file originally in xls in csv format from Google Sheets - Discussion/Research? #422

Open romicolman opened 2 weeks ago

romicolman commented 2 weeks ago

Problem description

If you import a dataset in xls in Google Sheets with any kind of problem that the ODE would detect as an error and then you download the dataset in csv to open it in the ODE, the app will detect different issues depending on the format.

Steps to reproduce it

Comments

When exporting the file in xls, the ODE will only detect blank labels for columns.

Captura de pantalla 2024-06-11 a la(s) 10 46 13 a  m

When exporting the file in csv, the ODE will inform the user about problems in the type of data in columns.

Captura de pantalla 2024-06-11 a la(s) 10 45 57 a  m

Is there any explanation for this or is it a bug?

guergana commented 2 days ago

Hello, @romicolman I can reproduce the issue more or less. This is what i get. The errors are the same in the report but I don't get those extra empty columns on the right like in your screenshot.

Image

I know why the error is happening. Exactly this column is a decimal number separated with a comma. Since the csv by definition uses the comma to separate the different columns, Google sheets exports the comma separated values as a string (you can see in the csv that the numbers in this column are wrapped with "", example "3,04" for the first column). Ideally we would have the data exported as value,value,3.04 with the decimal separated by a dot, but we can't control much how users set up formatting in google sheets so we have to find a solution to this exact issue.

Frictionless is detecting a field with this format "number,number" as a geojson column type for some reason, when it should be detected as a string because the format in the csv is a string originally. Ideally this case should be detected and imported as a number... or even better, i think at some point the user should be asked what the schema type of the column is when importing these ambiguous types.

To make the error go away for now you should set the columns schema field value to string, (at the start it is set to geojson):

Image

But even this is not exactly what we want because this is exported as a string by google sheets but we should have the option to read it as a number and currently there is also an error in ODE if you change the schema type to number. You found a very particular case, lol.

I don't think the report message is particularly helpful for finding out what the problem is. Probably many non tech users wouldn't be able to figure out why the column is creating an error so I think this is a good opportunity as well to suggest more detailed error messages for type errors.

To think of a solution to this issue we need to know why it was designed like this. Why is this specific format imported as geojson? I think @roll can help us with that.

@pdelboca what are your suggestions for an issue like this?

romicolman commented 18 hours ago

Hello, @romicolman I can reproduce the issue more or less. This is what i get. The errors are the same in the report but I don't get those extra empty columns on the right like in your screenshot.

Image

I know why the error is happening. Exactly this column is a decimal number separated with a comma. Since the csv by definition uses the comma to separate the different columns, Google sheets exports the comma separated values as a string (you can see in the csv that the numbers in this column are wrapped with "", example "3,04" for the first column). Ideally we would have the data exported as value,value,3.04 with the decimal separated by a dot, but we can't control much how users set up formatting in google sheets so we have to find a solution to this exact issue.

Frictionless is detecting a field with this format "number,number" as a geojson column type for some reason, when it should be detected as a string because the format in the csv is a string originally. Ideally this case should be detected and imported as a number... or even better, i think at some point the user should be asked what the schema type of the column is when importing these ambiguous types.

To make the error go away for now you should set the columns schema field value to string, (at the start it is set to geojson):

Image

But even this is not exactly what we want because this is exported as a string by google sheets but we should have the option to read it as a number and currently there is also an error in ODE if you change the schema type to number. You found a very particular case, lol.

I don't think the report message is particularly helpful for finding out what the problem is. Probably many non tech users wouldn't be able to figure out why the column is creating an error so I think this is a good opportunity as well to suggest more detailed error messages for type errors.

To think of a solution to this issue we need to know why it was designed like this. Why is this specific format imported as geojson? I think @roll can help us with that.

@pdelboca what are your suggestions for an issue like this?

Hi @guergana! Let's wait for @roll 's comments, but in the meantime I think we need to I'll move this to sprint 6 so I can document it properly for the user guide if needed.

guergana commented 13 hours ago

@romicolman As suspected the data package standard format for geopoint is "lat, long" https://datapackage.org/standard/table-schema/#geopoint so this is not an issue per se with the app, we need to figure out how to make clear that there is this compatibility issue between google sheets when exporting csvs or find a way to mitigate it.