Filtering Subcells of CSV

ickc commented 7 years ago

See #16

I've seen such feature request for other pandoc filters dealing with csv before, but I didn't give enough time to think about a good syntax for it (and since I personally don't need this feature yet, I didn't give it a priority). And from the example given, the syntax is not very intuitive.

Although pantable is a 3rd party pandoc filter rather than native pandoc syntax (because @jgm doesn't think CSV format is markdown-ish enough (i.e. can be read as plain text), I still want it to be as markdown-ish as possible.

ickc commented 7 years ago

@reenberg, a few more thoughts:

If your project is large and is already using a build system, you could take advantage of external CSV (those specified with a path in the markdown) and write another Python script to create the subcell CSV files, which is to be included in the markdown through the path. The advantage of this is that pantable can remains simple and leaves the infinite configurability through another independent script (i.e. each tool do one thing and one thing good).
Alternatively, there might always be some advanced functionality that could be done, but its syntax might be too complicated. We could hide these functionalities inside a key so that normal users are not confused with the syntax. Furthermore, these advanced features can be said to be experimental to allow faster prototyping without committing into it.

I have another feature I had in mind (that generates plots from CSV) that share the same problem with this feature (i.e. syntax seems to complicated and more like programming than being markdown-ish).

ickc commented 7 years ago

@reenberg

Another reason this should be treated specially is that pantable and pantable2csv are designed to be "invertible" (or strictly speaking, idempotent as far as pandoc's being idempotent). If you have a subcells filter then by definition it is not invertible.

I now have a better idea of how to group this: under an "extension" key, and add a module with name extension, where all such functions resides in. These functions would then somehow manipulate the CSV is a certain way (i.e. these functions are filters of the CSV). In this way, the regex and the "column-selection" filter should be 2 separate extension/filter functions.

Refactoring of the original code is necessary though. It was originally designed as a single file filter. But it becomes more sophisticated that a better organization is necessary. See #9

reenberg commented 7 years ago

@ickc, I see your point of the reversability of pantable and pantable2csv. I just thought that the reverse was a pure gimmick or at least I didn't see any practical use cases for it as a filter.

It produces a .csv file as side effect of running pandoc -F pantable2csv input.md on a markdown file containing a grid table?

It is true that I could have some part of my build system that generates the desired files as intermediate. However I was not aware of any tool that did what i needed, and implementing this feature my self seemed simpler to add, to the filter, instead of making the boiler plate code my self and creating a script or something else.
And I especially liked the idea that I could see, in the yaml block, which filtering I was doing instead of having this hidden away in some Makefile or some other custom script for each generated file. This gives me the overview when writing/editing my document. That feels markdown-ish to me.

I like the idea of extensions, however it seems to be a lot of work to implement (different hooking points, etc).

ickc commented 7 years ago

I just thought that the reverse was a pure gimmick or at least I didn't see any practical use cases for it as a filter.

As a matter of fact without this feature pantable would be useless for me ;) I might have explained why this is useful in the README / pandoc-discuss.

I probably will further explore the idea of extensions here. Ideally an extension can be specified without committing into pantable (just like how pandoc filters work independent of pandoc).

ickc commented 7 years ago

Also see #21.

ickc commented 7 years ago

The reason I reference #21 is because one of the point there address that inevitably when this feature is added, performance can becomes a big problem, because there's no practical limit on the size of the CSV.

reenberg commented 7 years ago

I think the idea of extensions is a great idea, and I completely see the point of trying to keep the code base as clean and simple as possible.

However it seems like a big and messy thing to introduce, for just a few extensions?

Anyways perhaps something like this could be of interest? https://github.com/tarekziade/extensions It is old and has apparently been moved to github as of lately. I guess hooks could be created by creating "groups", which extensions could register their callback functions under (e.g., 'pantable.csv_read' or 'pantable.read_data' which could be called just before returning the raw_table_list in read_data). That actually seems like very little impact on the pantable code base.

ickc commented 7 years ago

I will think about that a bit more.

To explain why extension is a good idea abeit seems overkill for now: it is because I have a lot of things in my mind that I want to do with pantable. pantable and pantable2csv are decided to make csv table "almost" a first class table syntax in pandoc (pandoc offered 4 syntaxes for tables, there were discussions to add csv syntax as an official pandoc table extension, but it doesn't pass the "markdown-ish" criteria for @jgm. i.e. readable as plain text). This design is nice, and have been very useful to be an intermediate format when I need to batch convert a bunch of docx to md. Furthermore, its usefulness seems to attract quite a few users. (just to add a few words, I design it so that the csv table syntax capture all possibilities given by the pandoc AST, which at the time it is written all 4 pandoc table extensions fell short of. But since I spoke with @jgm for the matter, he quickly add a syntax such that the grid table supports everything the AST support.)

Anyway, pantable and pantable2csv are designed with these in mind, so that more careful thoughts are given to syntaxes and its behavior to mimic a pandoc experience (e.g. you almost cannot trigger a Python exception, perhaps except for a malformed yaml syntax. warning messages are given and it will try its best to proceed).

But there are other things lesser "careful thoughts" should be given to allow for innovations, such as this. If you think about it, it is how pandoc behaves. Official features are almost always well-thought but takes tons of time to be immeplemented. By allowing extensions/filters, less rigorous process can be allowed while keeping the upstream clean.

ickc commented 7 years ago

In #21, I mentioned a solution to solve the memory consumption of arbitrary large input CSV size.

The solution is simple, it should be lazy evaluated (using iterator rather than turning everything into list).

But the downside is, either the code somewhat need to be completely rewritten (although probably won't be too much different?), or functionality like this cannot be (at least very difficult to be) an external filter, because it has to happen when reading from the CSV.

ickc / pantable

Filtering Subcells of CSV #17