pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.24k stars 17.79k forks source link

REF: consolidate in CSV parser module #39345

Open arw2019 opened 3 years ago

arw2019 commented 3 years ago

Follow-up to #38930, mentioned in #39217 and #38370

We'd like to consolidate the parser classes (C, python and pyarrow). One way to go would be to move the logic common to all parsers into ParserBase and delegate parser-specific logic into methods on the subclasses that ParserBase calls

IMO this is best done after the initial pyarrow engine PR (#38370) is merged (so as to to kill all the birds with one stone) but ofc completely up to whoever is doing the work. Creating this issue to track the discussion(s) in either case

cc @jreback @phofl

phofl commented 3 years ago

Thanks for opening. I wanted to start with a few smaller steps working on single functions, If this has a big impact on the pyarrow case we could wait of course.

jreback commented 3 years ago

we should do this now before the iyar row change (which certainly can inspire)

phofl commented 3 years ago

@arw2019 I came back to this today and looked into it a bit more. Since most of the C-stuff lives in the cdef class TextReader, we can not directly share with PythonParser. Have you already put any thought in this how to reorganize here to share more of the duplicated stuff? Like building a equivalent TextReader class for the python parser handling most of the stuff for the PythonParser as it is done in TextReader?