Closed root-11 closed 2 years ago
this would be a huge additional burden on an already overloaded pandas core team - if this is targeted towards pandas mainline
IF a well tested and thoroughly documented package / extension existed we could consider hooking into pandas.
so that's what i would do. create an external package that is separately maintained. sure there might be occasional hiccups with pandas proper but well tested code should catch this
we already allow external entry points for plotting, sql engines and some other hooks - this could be straightforward to add
@jreback Okay. I will create the package (next week or so) then and let you can evaluate it and determine if you'd like to hook into it. Deal?
sure can have a look
Minor update: I've done some research on the topic and a lot of work has been done since 2016. I will publish a summary tomorrow 22/7.
Minor update: 2+ weeks into the work, I have a model that passes most tests for single table csv. multi-table csv is not solved.
A major issue is that a csv with no header cannot be determined using frequency analysis. For example:
new price | old price |
---|---|
1,300.00 | 2,100.00 |
stored as new price;old price\n1,300.00;2,100.00\n
will be identifiable in conjunction with the header as line 0 has 1 common punctuation mark ;
whilst line 1 has use of 3 punctuation marks:
,
.
;
without the header, the table may as well read as
new price | old price | missing |
---|---|---|
1 | 300.00;2 | 100.00 |
or
new price | old price | missing |
---|---|---|
1,300 | 00;2 | 100.00 |
I'll keep poking at this...
After much research I've found several cases where the problem is intractable. I surrender ;-)
thanks @root-11
an analyzer that can solve some cases is still of course useful
Is your feature request related to a problem?
the csv_reader has many features, but getting the settings right is quite a pain.
I look after a corporate data analysis platform and found that the largest waste of time over a due course of 76000 engineering hours came from figuring out how to import the data.
To back this with some numbers I find from 1.5Tb of zipped csv data (billions of rows of data):
Reading CSV files has been treated as an encoding problem. It evidently isn't. It's a pattern recognition problem.
Existing csv-readers and file sniffers fail because they:
dateutils
parser only evaluates a single value and will always say that 1.2.2012 is mm.dd.yyyy despite that the next value might be 29.2.2012 (I've got plenty of tests with that problem).Here's an ugly example:
Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"
That should be (including some text escape):
names)
Special relativity,
Photoelectric effect
Describe the solution you'd like
I would like to add a non-breaking package to pandas, so that to pandas csv_reader can receive the keyword "analyse" and ignore the default keywords.
When
pandas.csv_reader(somepath, analyse=True)
is called, the keyword will trigger the usage of a function in the top of the file_reader that will:If the function is external to pandas, it could be used as:
I can imagine additional output that pandas could react to, such as more detailed information from the analysis:
API breaking implications
An additional keyword will probably not break much.
csv_reader
already has 52 keywords.Describe alternatives you've considered
Create a package and have the option for pandas:
Keeping the package external will be the pain-point if there is be a disconnect between the pd.csv_reader
kwargs
and the dict thatcsv_analyze
returns.Additional context
The strings I've encountered in csv data are:
SEPARATORS
,###.##### last non-digit character indicates decimal, preceding different characters are thousand separators.
.###,
Examples:
(full treaty on: https://en.wikipedia.org/wiki/Decimal_separator)
However, notice that Hindi uses a 2-digit grouping, except for the 3-digit grouping for denoting hundreds: 12,34,56,789.00
SCIENTIFIC NOTATION
E### integer before and after E
e
.####E### floating point before, integer after E
.####e
.###N <=3 digit float, followed by N belongs to (K,M,G/B, T,E,P) for kilo, mega, giga/bill, tera, exa,...
POSITIVE/NEGATIVE
Negative numbers can have tailing minus
ADDITIONAL SIGNS
The same applies for percentages: 98%, 98 %, 98 pct, %98 And for currencies: $100.00, kr100,00
NON-LATIN numbers
NUMBER FORMATTING Script Digits Used Latin 0 1 2 3 4 5 6 7 8 9 Arabic ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ Chinese / Japanese 〇 一 二 三 四 五 六 七 八 九 十… Hebrew א ,ב ,ג, ד, ה, ו, ז, ח ,ט… Korean 일 이 삼 사 오 육 칠 팔 구… The Korean regularly uses both a Sino-Korean system and a native Korean system. Everything that can be counted will use one of the two systems, but seldom both. 하나 둘 셋 넷 다섯 여섯 일곱 여덟 아홉…. Bengla ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ Devanagari (script used to write Hindi,Marathi, and other languages) ० १ २ ३ ४ ५ ६ ७ ८ ९ Gujarati ૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯ Gurmukhi (one of the scripts used to write Punjabi) ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ Kannada ೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯ Malayalam ൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯ Odia ୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯ Tamil ௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯ Telugu ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯ Thai ๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ Tibetan ༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩
FLOATING POINT the floating point precision issue (which in particular haunts anyone who reads long barcodes that aren't imported as integers):
NOT A NUMBER
Similar variation appears in datetime locale:
And so for time:
TEXT ESCAPE Finally we also see newline characters in headers which the csv-reader cannot deal with. To detect the correct format of the example below multiple lines have to be read and the internal between newline and separators. These will also have to be text and bracket escaped.
Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"
See table in introduction as
ugly example
LINEBREAK very old files can have linebreaks \r \r\n or \n