pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.2k stars 17.77k forks source link

ENH: Too much faf getting csv_reader to work. #42479

Closed root-11 closed 2 years ago

root-11 commented 3 years ago

Is your feature request related to a problem?

the csv_reader has many features, but getting the settings right is quite a pain.

I look after a corporate data analysis platform and found that the largest waste of time over a due course of 76000 engineering hours came from figuring out how to import the data.

To back this with some numbers I find from 1.5Tb of zipped csv data (billions of rows of data):

Reading CSV files has been treated as an encoding problem. It evidently isn't. It's a pattern recognition problem.

Existing csv-readers and file sniffers fail because they:

Here's an ugly example:

Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"

That should be (including some text escape):

Birthdate (Family
names)
Father Mother Child known for
1879-4-14 Einstein Hermann Pauline Albert General relativity,
Special relativity,
Photoelectric effect

Describe the solution you'd like

I would like to add a non-breaking package to pandas, so that to pandas csv_reader can receive the keyword "analyse" and ignore the default keywords.

When pandas.csv_reader(somepath, analyse=True) is called, the keyword will trigger the usage of a function in the top of the file_reader that will:

If the function is external to pandas, it could be used as:

import pandas as pd
import csv_analyze
df = pandas.csv_reader(somepath, **csv_analyze(somepath))

I can imagine additional output that pandas could react to, such as more detailed information from the analysis:

d = csv_analyze(path)
d
    {"encoding": {"depth": 10043,  # characters checked.
                  'cp855': 10043,  # meaning 10043/10043 = 100% characters match.
                  'utf_8_sig': 12,  # meaning decode error after 12th character.
                  'utf-8': 5,
                  "....."},
     "text_escape": [False, True, False, False, True],
     "column_names": ["Birthdate", "(Family\nnames)", "Father", "Mother ", "Child", "known for"],
     "datatypes": ["date", "str", "str", "str", "str"],
     "metadata": [  # in order like columns, with list of probabilities for each type.
         {"date": {'yyyy-mm-dd': (100, 100),
                   'mm-dd-yyyy': (0, 100),
                   "....": (0, 0)},
          "time": 0,
          "datetime": 0,
          "str": "pass",
          "int": 0,
          "float": 0},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError}
     ]
     }

API breaking implications

An additional keyword will probably not break much. csv_reader already has 52 keywords.

Describe alternatives you've considered

Create a package and have the option for pandas:

import pandas as pd
import csv_analyze
d = csv_analyze(somepath)
pandas.csv_reader(somepath, **d['pandas'])

Keeping the package external will be the pain-point if there is be a disconnect between the pd.csv_reader kwargs and the dict that csv_analyze returns.

Additional context

The strings I've encountered in csv data are:

SEPARATORS

,###.##### last non-digit character indicates decimal, preceding different characters are thousand separators.

.###,

Examples:

4 294 967 295,000  Canadian (English and French), Danish, Finnish, French, German 
4.294.967.295,000  Italian, Norwegian, Spanish, 
4 294 967 295,000  Swedish 
4,294,967,295.000  GB-English, US-English, Thai

(full treaty on: https://en.wikipedia.org/wiki/Decimal_separator)

However, notice that Hindi uses a 2-digit grouping, except for the 3-digit grouping for denoting hundreds: 12,34,56,789.00

SCIENTIFIC NOTATION

E### integer before and after E

e

.####E### floating point before, integer after E

.####e

.###N <=3 digit float, followed by N belongs to (K,M,G/B, T,E,P) for kilo, mega, giga/bill, tera, exa,...

POSITIVE/NEGATIVE

Negative numbers can have tailing minus

527-
-527
(527)
[527]

ADDITIONAL SIGNS

The same applies for percentages: 98%, 98 %, 98 pct, %98 And for currencies: $100.00, kr100,00

NON-LATIN numbers

NUMBER FORMATTING Script Digits Used Latin 0 1 2 3 4 5 6 7 8 9 Arabic ٠‎ ١‎ ٢‎ ٣‎ ٤‎ ٥‎ ٦‎ ٧‎ ٨‎ ٩ Chinese / Japanese 〇 一 二 三 四 五 六 七 八 九 十… Hebrew א ,ב ,ג, ד, ה, ו, ז, ח ,ט… Korean 일 이 삼 사 오 육 칠 팔 구… The Korean regularly uses both a Sino-Korean system and a native Korean system. Everything that can be counted will use one of the two systems, but seldom both. 하나 둘 셋 넷 다섯 여섯 일곱 여덟 아홉…. Bengla ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ Devanagari (script used to write Hindi,Marathi, and other languages) ० १ २ ३ ४ ५ ६ ७ ८ ९ Gujarati ૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯ Gurmukhi (one of the scripts used to write Punjabi) ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ Kannada ೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯ Malayalam ൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯ Odia ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯ Tamil ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ Telugu ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯ Thai ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ Tibetan ༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩

FLOATING POINT the floating point precision issue (which in particular haunts anyone who reads long barcodes that aren't imported as integers):

val = "0.3066101993807095471566981359501369297504425048828125"
print(float(val))
0.30661019938070955

NOT A NUMBER

"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null". "-", "--", "###"

Similar variation appears in datetime locale:

yyyy-mm-dd  Canadian (English and French), Danish, German, Swedish
dd.mm.yyyy  Finnish 
dd.mm.yy    Italian, Norwegian 
dd-mm-yy    Spanish 
dd/mm/yy    GB-English 
mm-dd-yy    US-English 
dd/mm/yyyy  Thai 

And so for time:

23:59      Canadian 
23.59      Finnish
23.59 Uhr  German
Kl 23.59   Norwegian  
11:59 PM   Thai
11.59 PM   UK english

TEXT ESCAPE Finally we also see newline characters in headers which the csv-reader cannot deal with. To detect the correct format of the example below multiple lines have to be read and the internal between newline and separators. These will also have to be text and bracket escaped.

Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"

See table in introduction as ugly example

LINEBREAK very old files can have linebreaks \r \r\n or \n

jreback commented 3 years ago

this would be a huge additional burden on an already overloaded pandas core team - if this is targeted towards pandas mainline

IF a well tested and thoroughly documented package / extension existed we could consider hooking into pandas.

so that's what i would do. create an external package that is separately maintained. sure there might be occasional hiccups with pandas proper but well tested code should catch this

we already allow external entry points for plotting, sql engines and some other hooks - this could be straightforward to add

root-11 commented 3 years ago

@jreback Okay. I will create the package (next week or so) then and let you can evaluate it and determine if you'd like to hook into it. Deal?

jreback commented 3 years ago

sure can have a look

root-11 commented 3 years ago

Minor update: I've done some research on the topic and a lot of work has been done since 2016. I will publish a summary tomorrow 22/7.

root-11 commented 3 years ago

Minor update: 2+ weeks into the work, I have a model that passes most tests for single table csv. multi-table csv is not solved.

A major issue is that a csv with no header cannot be determined using frequency analysis. For example:

new price old price
1,300.00 2,100.00

stored as new price;old price\n1,300.00;2,100.00\n will be identifiable in conjunction with the header as line 0 has 1 common punctuation mark ; whilst line 1 has use of 3 punctuation marks:

without the header, the table may as well read as

new price old price missing
1 300.00;2 100.00

or

new price old price missing
1,300 00;2 100.00

I'll keep poking at this...

root-11 commented 2 years ago

After much research I've found several cases where the problem is intractable. I surrender ;-)

jreback commented 2 years ago

thanks @root-11

an analyzer that can solve some cases is still of course useful