jazzband / tablib

Python Module for Tabular Datasets in XLS, CSV, JSON, YAML, &c.
https://tablib.readthedocs.io/
MIT License
4.58k stars 589 forks source link

csv file format not auto detected. #502

Closed knowingpark closed 2 years ago

knowingpark commented 2 years ago

I was surprised to get an error loading a csv file, the error was:

tablib.exceptions.UnsupportedFormat: Tablib has no format 'None' or it is not registered.

I got it to load by adding a format='csv' parameter to the load function. I just thought this isn't right! So I look a bit deeper.

All the way into the Python csv itself, which is where the problem was being created. The csv module has a _guess_delimiter method which doesn't play well with the 1024 byte csv sample the format._csv.py was sending it. _guess_delimiter doesn't expect partial lines. The partial line at the end of the sample doesn't have enough columns and _guess_delimiter gives up and raises an Error. This error trips the try/except in _csv.detect().

I made it work by creating a stream_sample that ends on a '\n'

@classmethod
    def detect(cls, stream, delimiter=None):
        stream_sample = "\n".join(stream.read(1024).split("\n")[:-1])
        try:
            csv.Sniffer().sniff(
                stream_sample, delimiters=delimiter or cls.DEFAULT_DELIMITER
            )
            return True
        except Exception:
            return False

I'm using python 3.8, tablib 3.0.0 The problem should be probably be resolved in the csv module! I've never made a PR, so I'm happy to let someone else resolve this, if it is decided that it is a real problem.

hugovk commented 2 years ago

Please can you provide a minimal script and CSV file we can run to reproduce the error?

julianebeli commented 2 years ago

A minimal script

from tablib import Dataset

with open('sociologists.csv', 'r', encoding='utf8') as fh:
    data = Dataset().load(fh)
    #boom!
print(data)

Some data sociologists.csv

Add the code from my previous post to .formats._csv.detect() to fix it.

I've been looking at the csv module, it may be that the 1024 byte sample is too small for the 'guesser' to work. But a small sample that ends on a newline works. 1024 bytes of the file I provided is only 6 full rows and 1 chunk of the first column. The 'guesser' counts 7 commas 6 times then 0 commas once and concludes "," can not be a delimiter, because not all the rows have the same number of comma's. The algorithm is more reliable with 100's of rows

BTW I'm now getting the error on Linux, Python 3.7.3 and tablib 2.0.0

hugovk commented 2 years ago

Thanks! To detect CSVs, Tablib is using csv.Sniffer().sniff with delimiters=",".

Python docs: https://docs.python.org/3/library/csv.html?highlight=csv#csv.Sniffer

A non-exception means we detected a CSV, an exception means we didn't detect CSV.

It's important to note the note in the docs:

Note: This method is a rough heuristic and may produce both false positives and negatives.


Here's a script to reproduce it without using Tablib, where abc.csv is a much simpler CSV:

a,b,sociologyLabel
1,2,Yevgeny Prokhorov
import csv

with open("abc.csv") as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=",")
    print(dialect)

with open("sociologists.csv") as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=",")
    print(dialect)

First one is okay, but the second not:

<class 'csv.Sniffer.sniff.<locals>.dialect'>
Traceback (most recent call last):
  File "/Users/hugo/github/tablib/1.py", line 10, in <module>
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=",")
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/csv.py", line 187, in sniff
    raise Error("Could not determine delimiter")
_csv.Error: Could not determine delimiter

As you say, the csv module's _guess_delimiter is unsuccessful with 1024 chars of the longer sociologists.csv data, and your analysis sounds about right, that the guesser works with more rows, and sociologists.csv has long rows so it reads fewer in.

If we switch from 1024 to 2048 we are successful:

 import csv

 with open("abc.csv") as csvfile:
-    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=",")
+    dialect = csv.Sniffer().sniff(csvfile.read(2048), delimiters=",")
     print(dialect)

 with open("sociologists.csv") as csvfile:
-    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=",")
+    dialect = csv.Sniffer().sniff(csvfile.read(2048), delimiters=",")
     print(dialect)
<class 'csv.Sniffer.sniff.<locals>.dialect'>
<class 'csv.Sniffer.sniff.<locals>.dialect'>

So what to do?

  1. If the user knows they're using CSV, they can declare the format when loading:
 from tablib import Dataset

 with open('sociologists.csv', 'r', encoding='utf8') as fh:
-    data = Dataset().load(fh)
+    data = Dataset().load(fh, "csv")
-     #boom!
 print(data)
  1. We could switch Tablib to read 2048 (or more? Some other formats read the whole file) instead of 1024. Even longer CSV lines may not be autodetected, but option 1 is always possible. Thoughts?

  2. If you want to report this to CPython, see the devguide for instructions, specifically https://devguide.python.org/tracker/. You'd need to create an issue on https://bugs.python.org/ and then create pull requests on https://github.com/python/cpython. There's already a few CSV sniffer issues so check first if there's already something there. And it's possible they may not accept changes due to the "rough heuristic" note and/or being able to increase the input sample, but you never know!

knowingpark commented 2 years ago

Hi In response to your test. abc.csv is less that 1024 bytes. This means you passed the entire file to the sniffer, so all the rows have the same number of columns. This test fails if you add a new partial row

a,b,sociologyLabel
1,2,Yevgeny Prokhorov
2

It doesn't work with my file for the same reason. The last line in the 1024 byte sample is not a full row (it is not a even full column). This is the real problem. The last line skews the statistcs that the sniffer is calculating.

When I take the 1024 byte sample of my csv and chop off the last, incomplete, line, the sniffer works. My conclusion is that, in a small sample, an incomplete csv row breaks the sniffer's algorithm.

Increasing the sample size is a good option, and so is not sending incomplete rows. Doing both is even better!

hugovk commented 2 years ago

Thanks, interesting the toy example also breaks so easily with a partial row!

Anyway, as to ensuring we only send complete lines, I'm wondering if this preprocessing might have any adverse or performance hit, especially when we're running non-CSV data through.

But I think doubling the input is an easy and safe thing to do, so a PR is welcome for that.

knowingpark commented 2 years ago

Cool. Please point me to the PR instructions. It will be my first one :) Exciting!

hugovk commented 2 years ago

Great!

We have a brief guide at https://github.com/jazzband/tablib/blob/master/.github/CONTRIBUTING.md

Here's a video showing how to make your first PR:

https://hacktoberfest.digitalocean.com/resources

And another guide at:

https://guides.github.com/activities/forking/