ickc / pantable

CSV Tables in Markdown: Pandoc Filter for CSV Tables
https://ickc.github.io/pantable/
BSD 3-Clause "New" or "Revised" License
86 stars 15 forks source link

Use another CSV parser? #21

Closed ickc closed 7 years ago

ickc commented 7 years ago

It is known that unicode support in CSV in Python 2 is tricky (see documentation of the csv moduel, and #14). This is originally written in Python 3 (back then panflute only support Python 3 before I ported it to be compatible in Python 2 as well), and use the same trick applied on panflute to support Python 2. The original thought is having partial Python 2 support (without unicode in CSV) is better than no support at all. But then unavoidably people do use this with Python 2 and unicode.

14 proposed a fix that could solve the unicode problem. However, for various reason, an alternative CSV parser is considered:

  1. It seems that CSV module in Python 2 and 3 behaves slightly differently. The last thing I want is Python 2 and 3 users see different behavior, leading this package to be less maintainable (frankly I don't want to deal with differences between Python 2 and 3...).

  2. As in #16, #17 that people would like to extend the functionality of pantable to be able to filter subcells from the CSV input. This feature might make the efficiency of CSV parser more critical. It is because if there's no filtering capability, we can reasonably assume the table size is small (for LaTeX, constraints by pages, for others like HTML, at least it is not too big to be rendered by a browser efficiently). But with filtering, the source CSV can be arbitrarily large, and only a small subset of table cells are filtered.

  3. When using another CSV parser not from the standard library, then we need to either deal with that extra dependency, or conditional import, making the end users making the choice (and installing).

Criteria, based on the above 2 reasons, and other concerns:

  1. uniform Python 2/3 behavior
  2. unicode support
  3. high efficiency
  4. try to avoid conditional import so that I don't need to deal with different behaviors from different CSV parser (Python 2 & 3 CSV module, & that conditionally imported CSV module)
  5. try to make the dependency small and easy to install

Potential choices are:

  1. unicodecsv
  2. fastcsv
  3. numpy csv parser
  4. pandas csv parser

I like the pandas CSV parser since it is well known to be very fast. And I want some of pandas' capability to generate plots from tables. But it needs to be compiled, and alternative CPU architecture might or might not be supported (at least no pre-built binaries).

reenberg commented 7 years ago

1) Didn't backports.csv complete remove this difference? At least as far as I can see from my implementation of it.

I can easily follow you on the point that you don't want to deal with special imports for py2.

2) Is performance actually that bad? I know i wrote that in the initial comment about backports.csv, but that was purely from a standpoint where the author of that package claimed poor performance.

3) Well since this is a package dealing with csv, I would say that it is only fair that you depend on a third party library do the csv stuff, in an efficient way.

Is the builtin csv module for py3 also entirely written in python? aka slow?

As far as I can see, then you need quite a few extra packages in order to get speed out of Pandas? -- defeating 1), and not to mention the  wooping 22mb for the pandas package.  NumPy is not a small thing either (16mb -- I don't know if you can get just the csv parser by it self?).
ickc commented 7 years ago

Note: Pandas' CSV parser will complain for irregular sizes. Numpy's will only like floats.

ickc commented 7 years ago

1) Didn't backports.csv complete remove this difference? At least as far as I can see from my implementation of it.

I can easily follow you on the point that you don't want to deal with special imports for py2.

To remind meself: see https://github.com/ickc/pantable/issues/14#issuecomment-319521689.

When I wrote this, I completely forgot backports.csv. Seems promising. Especially if you can get setup.py conditionally depending on this only on py2, that'll be great. Else, I think it's still ok.

2) Is performance actually that bad? I know i wrote that in the initial comment about backports.csv, but that was purely from a standpoint where the author of that package claimed poor performance.
3) Well since this is a package dealing with csv, I would say that it is only fair that you depend on a third party library do the csv stuff, in an efficient way.

Is the builtin csv module for py3 also entirely written in python? aka slow?

As far as I can see, then you need quite a few extra packages in order to get speed out of Pandas? -- defeating 1), and not to mention the  wooping 22mb for the pandas package.  NumPy is not a small thing either (16mb -- I don't know if you can get just the csv parser by it self?).

From the source code: cpython/csv.py at master · python/cpython, it is pure Python.

But on 2nd thought, may be we could only think about performance issue when it really becomes a big problem. e.g. currently pantable2csv is very slow, because for each cell pandoc is called to converted the AST back to markdown. And there's no obvious way to speed this up (apart from multiprocessing), because the overhead is to start a subprocess on pandoc (to change this situation will be highly non-trivial, I suppose. Right now from what I think I know, all existing Python libraries to call pandoc actually call pandoc itself as a cli tool, not as a "library level". There are existing C bindings of pandoc, and presumably a proper "py-pandoc" should use this and the Python C API).

After all, pandoc is not known for performance. And personally I don't quite care the time it takes to generate a document for various reasons. (For one, I need PDF from LeTeX output and LeTeX speed becomes the ultimate bottleneck, much slower than pandoc, panflute, or pantable's.)

By the way, when you mention the size, is that the installed size or the source code size? I knew before that the pandoc communities seems to be sensitive to the dependency size. But personally I often have big packages that on the order of ~10 MB seems very small to me. I worry more about the installation problem though, since compiling them means a certain requirements on the environment (e.g. OS, CPU architecture, presence of compilers, etc.).

ickc commented 7 years ago

@reenberg, I'm thinking we should give the backport.csv a go if it works well for you, and passes all tests. It'll be great if you can also add a test case that has unicode in it (I haven't because I knew the Python 2 build would fail that. I should have mentioned this limitation in the documentation. And I thought I did...)

sergiocorreia commented 7 years ago

Is the builtin csv module for py3 also entirely written in python? aka slow?

Just to chime in: I've imported GBs of data with the csv module and AFAIK it's anything but slow. Waiting for IO is the bottleneck, and anyways I doubt that anyone is writing tables that large.

ickc commented 7 years ago

@sergiocorreia, the original concern is that once filtering is allowed, there's no upper limit on the size of CSV. And then I know there can be very big CSV in data science. In that case, it is well-known that Pandas does it much faster.

But then I discovered other limitations from Pandas' CSV parser that won't be ideal for pantable. And also taken into account performance is not too critical here, so we should only worry it when it becomes a real problem.

Also note that a more important criteria on performance in this case is actually memory use, not CPU time. I didn't profile it yet, but I guess Python's implementation will waster a lot of memory, because low memory usage is very hard to achieve.

But working on something related to pandoc (and LaTeX) means these are not very important, because pandoc (and LaTeX) themselves are already a bottlenect in some way. (pandoc's AST model, perhaps also because of its design, and use of Haskell, makes high memory use unavoidable.)

reenberg commented 7 years ago

@ickc

By the way, when you mention the size, is that the installed size or the source code size? I knew before that the pandoc communities seems to be sensitive to the dependency size. But personally I often have big packages that on the order of ~10 MB seems very small to me. I worry more about the installation problem though, since compiling them means a certain requirements on the environment (e.g. OS, CPU architecture, presence of compilers, etc.).

I was referring to the download size of the whell. Checking the actual size on disk is way worse

$ du -hs /tmp/venv2/lib/python2.7/site-packages/pandas
97M     /tmp/venv2/lib/python2.7/site-packages/pandas

I just gathered from your general comments that you want to try and have a low footprint if at all possible.

reenberg commented 7 years ago

I would leave it, as is (with unicode support), until someone reports performance as being a real problem.

ickc commented 7 years ago

From my earlier comments, Pandas and Scipy/Numpy's CSV parsers will not be suitable here due to their more restricted requirements.

Other lightweight parsers might be ok, but for now backport.csv seems does the job of unifying the CSV parser behavior between py2 and py3, which is one of the primary concerns here.

Regarding performance, I think memory concerns are more important that speed (If the CSV is in the markdown source, it won't be too big. If it is external file, it will likely be IO bound. But no bound on CSV file size means there's no bound in memory consumption, and Python probably would make this worse.). And I have another method to avoid this issue when filtering is needed. (See #17.)

ickc commented 7 years ago

From some of the notes above, a few CSV parsers are eliminated because of the extra restrictions.

@reenberg suggested a package, and that package suggested using backport.csv, which is a backport from py3's csv to py2. @reenberg added the use of backport.csv in the py2 build only (i.e. no extra dependency for py3 users). And it seems run fine. Hope that it solves the problems in the differences between py2 and py3's csv module, as seen in #11, #13, #14, #21.