root-11 commented 3 years ago

Is your feature request related to a problem?

the csv_reader has many features, but getting the settings right is quite a pain.

I look after a corporate data analysis platform and found that the largest waste of time over a due course of 76000 engineering hours came from figuring out how to import the data.

To back this with some numbers I find from 1.5Tb of zipped csv data (billions of rows of data):

Reattempts for engineers to upload the same file with different encoding: 4.1x (Poisson distributed).
CSV imports that are accepted by the engineer: 95/105
Correct inference of datetime: 2 / 396 cols with name date, time or datetime (in 17 languages)

Reading CSV files has been treated as an encoding problem. It evidently isn't. It's a pattern recognition problem.

Existing csv-readers and file sniffers fail because they:

don't detect encoding correctly or work on bad encoding.
typically only read the first line after the headers.
don't handle text escapes during column detection.
dateutils parser only evaluates a single value and will always say that 1.2.2012 is mm.dd.yyyy despite that the next value might be 29.2.2012 (I've got plenty of tests with that problem).

Here's an ugly example:

Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"

That should be (including some text escape):

Birthdate	(Family names)	Father	Mother	Child	known for
1879-4-14	Einstein	Hermann	Pauline	Albert	General relativity, Special relativity, Photoelectric effect

Describe the solution you'd like

I would like to add a non-breaking package to pandas, so that to pandas csv_reader can receive the keyword "analyse" and ignore the default keywords.

When pandas.csv_reader(somepath, analyse=True) is called, the keyword will trigger the usage of a function in the top of the file_reader that will:

Detect the encoding using one of the ~300 python accepted encodings
analyse the file,
return the correct formats to pandas as **kwargs

If the function is external to pandas, it could be used as:

import pandas as pd
import csv_analyze
df = pandas.csv_reader(somepath, **csv_analyze(somepath))

I can imagine additional output that pandas could react to, such as more detailed information from the analysis:

d = csv_analyze(path)
d
    {"encoding": {"depth": 10043,  # characters checked.
                  'cp855': 10043,  # meaning 10043/10043 = 100% characters match.
                  'utf_8_sig': 12,  # meaning decode error after 12th character.
                  'utf-8': 5,
                  "....."},
     "text_escape": [False, True, False, False, True],
     "column_names": ["Birthdate", "(Family\nnames)", "Father", "Mother ", "Child", "known for"],
     "datatypes": ["date", "str", "str", "str", "str"],
     "metadata": [  # in order like columns, with list of probabilities for each type.
         {"date": {'yyyy-mm-dd': (100, 100),
                   'mm-dd-yyyy': (0, 100),
                   "....": (0, 0)},
          "time": 0,
          "datetime": 0,
          "str": "pass",
          "int": 0,
          "float": 0},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError}
     ]
     }

API breaking implications

An additional keyword will probably not break much. csv_reader already has 52 keywords.

Describe alternatives you've considered

Create a package and have the option for pandas:

import pandas as pd
import csv_analyze
d = csv_analyze(somepath)
pandas.csv_reader(somepath, **d['pandas'])

Keeping the package external will be the pain-point if there is be a disconnect between the pd.csv_reader kwargs and the dict that csv_analyze returns.

Additional context

The strings I've encountered in csv data are:

SEPARATORS

,###.##### last non-digit character indicates decimal, preceding different characters are thousand separators.

.###,

Examples:

4 294 967 295,000  Canadian (English and French), Danish, Finnish, French, German 
4.294.967.295,000  Italian, Norwegian, Spanish, 
4 294 967 295,000  Swedish 
4,294,967,295.000  GB-English, US-English, Thai

(full treaty on: https://en.wikipedia.org/wiki/Decimal_separator)

However, notice that Hindi uses a 2-digit grouping, except for the 3-digit grouping for denoting hundreds: 12,34,56,789.00

SCIENTIFIC NOTATION

E### integer before and after E

e

.####E### floating point before, integer after E

.####e

.###N <=3 digit float, followed by N belongs to (K,M,G/B, T,E,P) for kilo, mega, giga/bill, tera, exa,...

POSITIVE/NEGATIVE

Negative numbers can have tailing minus

527-
-527
(527)
[527]

ADDITIONAL SIGNS

The same applies for percentages: 98%, 98 %, 98 pct, %98 And for currencies: $100.00, kr100,00

NON-LATIN numbers

NUMBER FORMATTING Script Digits Used Latin 0 1 2 3 4 5 6 7 8 9 Arabic ٠‎ ١‎ ٢‎ ٣‎ ٤‎ ٥‎ ٦‎ ٧‎ ٨‎ ٩ Chinese / Japanese 〇一二三四五六七八九十… Hebrew א ,ב ,ג, ד, ה, ו, ז, ח ,ט… Korean 일 이 삼 사 오 육 칠 팔 구… The Korean regularly uses both a Sino-Korean system and a native Korean system. Everything that can be counted will use one of the two systems, but seldom both. 하나 둘 셋 넷 다섯 여섯 일곱 여덟 아홉…. Bengla ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ Devanagari (script used to write Hindi,Marathi, and other languages) ० १ २ ३ ४ ५ ६ ७ ८ ९ Gujarati ૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯ Gurmukhi (one of the scripts used to write Punjabi) ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ Kannada ೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯ Malayalam ൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯ Odia ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯ Tamil ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ Telugu ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯ Thai ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ Tibetan ༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩

FLOATING POINT the floating point precision issue (which in particular haunts anyone who reads long barcodes that aren't imported as integers):

val = "0.3066101993807095471566981359501369297504425048828125"
print(float(val))
0.30661019938070955

NOT A NUMBER

"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null". "-", "--", "###"

Similar variation appears in datetime locale:

yyyy-mm-dd  Canadian (English and French), Danish, German, Swedish
dd.mm.yyyy  Finnish 
dd.mm.yy    Italian, Norwegian 
dd-mm-yy    Spanish 
dd/mm/yy    GB-English 
mm-dd-yy    US-English 
dd/mm/yyyy  Thai

And so for time:

23:59      Canadian 
23.59      Finnish
23.59 Uhr  German
Kl 23.59   Norwegian  
11:59 PM   Thai
11.59 PM   UK english

TEXT ESCAPE Finally we also see newline characters in headers which the csv-reader cannot deal with. To detect the correct format of the example below multiple lines have to be read and the internal between newline and separators. These will also have to be text and bracket escaped.

Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"

See table in introduction as ugly example

LINEBREAK very old files can have linebreaks \r \r\n or \n

jreback commented 3 years ago

this would be a huge additional burden on an already overloaded pandas core team - if this is targeted towards pandas mainline

IF a well tested and thoroughly documented package / extension existed we could consider hooking into pandas.

so that's what i would do. create an external package that is separately maintained. sure there might be occasional hiccups with pandas proper but well tested code should catch this

we already allow external entry points for plotting, sql engines and some other hooks - this could be straightforward to add

root-11 commented 3 years ago

@jreback Okay. I will create the package (next week or so) then and let you can evaluate it and determine if you'd like to hook into it. Deal?

jreback commented 3 years ago

sure can have a look

root-11 commented 3 years ago

Minor update: I've done some research on the topic and a lot of work has been done since 2016. I will publish a summary tomorrow 22/7.

root-11 commented 3 years ago

Minor update: 2+ weeks into the work, I have a model that passes most tests for single table csv. multi-table csv is not solved.

A major issue is that a csv with no header cannot be determined using frequency analysis. For example:

new price	old price
1,300.00	2,100.00

stored as new price;old price\n1,300.00;2,100.00\n will be identifiable in conjunction with the header as line 0 has 1 common punctuation mark ; whilst line 1 has use of 3 punctuation marks:

2 ,
2 .
1 ;

without the header, the table may as well read as

new price	old price	missing
1	300.00;2	100.00

or

new price	old price	missing
1,300	00;2	100.00

I'll keep poking at this...

root-11 commented 2 years ago

After much research I've found several cases where the problem is intractable. I surrender ;-)

jreback commented 2 years ago

thanks @root-11

an analyzer that can solve some cases is still of course useful

pandas-dev / pandas

ENH: Too much faf getting csv_reader to work. #42479

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

,###.##### last non-digit character indicates decimal, preceding different characters are thousand separators.

.###,

E### integer before and after E

e

.####E### floating point before, integer after E

.####e

.###N <=3 digit float, followed by N belongs to (K,M,G/B, T,E,P) for kilo, mega, giga/bill, tera, exa,...