qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

qri: enable the inference of CSV field separator #1533

Open aborruso opened 4 years ago

aborruso commented 4 years ago

What feature or capability would you like?

A lot of CSVs in the world are not separed by ,. It would be great to infer the separator and make qri able to read every kind of CSVs.

Do you have a proposed solution?

No but I add the python way to do it

import csv
from datapackage import Resource
resource = Resource({u'path': 'input.csv'})
dialect = csv.Sniffer().sniff(resource.raw_read())
dialect.delimiter
b5 commented 4 years ago

Love it!

The place where this would land is in the detect package: https://github.com/qri-io/dataset/blob/d12a66b92250109b67cd1b74bca763baa0b847e4/detect/detect.go#L39-L47

We should add a FormatConfig function to detect that detects format configuration based on a data format. In the case of CSV files, it should sniff the delimiter.

it could also be used to clean up subsequent calls within the detect package itself, which uses a baseline format configuration for CSV files:

func CSVSchema(resource *dataset.Structure, data io.Reader) (schema map[string]interface{}, n int, err error) {
    tr := dsio.NewTrackedReader(data)
    r := csv.NewReader(replacecr.Reader(tr))
    r.FieldsPerRecord = -1
    r.TrimLeadingSpace = true
    r.LazyQuotes = true

If detect.FromReader infers & returns Structure.FormatConfig , it'll bubble up into qri here and should "just work" https://github.com/qri-io/qri/blob/aed31e903d07af8e805d5290934e10f41e95ae21/base/dataset_prepare.go#L178-L188