Implement column types to allow constraining input

krassowski commented 4 years ago

Idea: implement a column-wise validation of the user input using the column types feature of jExcel.

I frequently find myself having to work with the spreadsheets having mistakes in data entry process (like dates being in UK and US format). Having the (optional) column types could alleviate at least some of the problem (i.e. decrease probability of myself and you from creating invalid files; cannot help about the other Excel users though).

The design idea is to have a bar above the column names with a drop-down allowing to choose column type. By default all would be "text", and the bar would be hidden. On pressing a toolbar button the bar with type drop-downs would show up and the user would be able to choose the column type.

Questions to the community/users/myself:

[x] should the information about column types be stored in the CSV/TSV file?
[x] if yes, is there any standard specifying how to store the column type information?
[x] should we allow full-featured html entries (such as photos, see the jExcel example)? How would we save those?
[ ] column types have options (e.g. mask for currency, or choices for dropdown); how those should be specified? Should those be saved as well?

krassowski commented 4 years ago

Column type specification formats for CSV files in use

Separate `.csvt` file

This idea is used by qgis (which is a tool from the geospatial data world) and is a second csv file which has a single line and specifies the types of the columns.

Something-else-separated suffix after column name

neo4j uses colon-separated format of name:type for database imports: documentation.

:ID , name, joined:date, active:boolean, points:int

A similar idea but using | characters is used by spatialkey

Prefixes before headers

Another tool, flexmonster uses prefixes before column names; those look similar to C format specifiers.

Official recommendations and proposed standards

The stance of a UK government group

The UK Government Digital Service recommends not to include any additional data in CSV files:

This standard does not cover extra information about the tabular data, such as the column types, or table contents validation. You should keep this type of information separate from a CSV file to avoid it conflicting with the data structure as described in RFC 4180.

However, I could not find any such recommendation in RFC 4180. The neo4j format appears to be RFC 4180 compatible. I think the only problem would be if someone wanted to store this information in a new line (i.e. it would no longer be compatible).

W3C

The CSV on the Web: A Primer from W3C Working Group, from 25 February 2016 states:

There is no mechanism within CSV to indicate the type of data in a particular column, or whether values in a particular column must be unique. It is therefore hard to validate and prone to errors such as missing values or differing data types within a column.

And then proposes a schema for describing and validating CSV files. For the following file:

"country","name (en)","latitude","longitude"
"at","Austria","47.6965545","13.34598005"
"be","Belgium","50.501045","4.47667405"

they propose a schema like this:

{
  "@context": "http://www.w3.org/ns/csvw",
  "url": "countries.csv",
  "tableSchema": {
    "columns": [{
      "titles": "country",
      "datatype": "string"
    },{
      "titles": "name (en)",
      "datatype": "string"
    },{
      "titles": "latitude",
      "datatype": "number"
    },{
      "titles": "longitude",
      "datatype": "number"
    }]
  }
}

It appears to be a very advanced schema, with a huge potential for validation purposes. The document is a final draft and no recommendation nor endorsment of it was issued by W3C, so it is not an official standard.

Repo: https://github.com/w3c/csvw The follow-up happens in https://www.w3.org/community/csvw/.

There is also another, older schema format: https://specs.frictionlessdata.io//tabular-data-package/

krassowski commented 4 years ago

I am leaning towards either implementing the W3C-proposed schema, or towards .csvt. The former is more flexible, the latter is easier. UX-wise this should be optional - I would not want to have a new file created for every csv I edit. Not sure how I would communicate it in the UI.

The idea with adding a second header may be sub-optimal because it would be a pain to parse. Similarly the prefixes/suffixes. A lot can go wrong.

As I need the tool now (now is better than never), I will proceed with the .csvt approach (unless there is any feedback soonish). The file will be only created after the user changes a type of a column. In the future the schema approach can be added and the user could be allowed to configure which of the approaches to use in the settings. And the user could disable writing the csvt/schema files too.

krassowski commented 4 years ago

Maybe there could be a checkbox (off by default):

☐ Save the column types (csvt)

which would be only shown if the file was detected or if a type of any column was changed?

krassowski commented 4 years ago

Also, this would be useless if viewing an Excel spreadsheet, so should have a condition on the filename not ending with xls/xlsx (or on mimetype not being Excel)

krassowski commented 4 years ago

should we allow full-featured html entries (such as photos, see the jExcel example)? How would we save those?

Maybe we should just allow any html (thus also allowing formating, things like bold and italic). Maybe this should be opt-in. I do not know how to communicate this in UI. Maybe we should have an option "clean all formating"

Note: HTML can be easier to parse than Excel.

krassowski commented 4 years ago

Proof of concept implemented in eb3c879be8b7f868e89723a3eb2a420fe191132c (v0.3):

Screenshot from 2020-06-05 15-47-52

The types are not stored yet and styling requires work - but it is just POC.

jupyterlab-contrib / jupyterlab-spreadsheet-editor