Added column-filter functionality

reenberg commented 7 years ago

<tl;dr> This adds the feature to optionally filter out entire columns and rows based on column data in a .csv file, if you only wan't to create a small table based on a large .csv file.

If the column_filter is not specified or is an empty list, then the table is not modified. Else the raw_table_list is filtered based on the values in the column_filter (i.e., column indexes not specified in the filter is removed).

Each element in the column_filter list must be an integer or a dictionary with at least the key 'col'.

Specifying an integer in the column_filter list makes sure that column index is kept (first column is index 0 -- python list indexing).

Specifying a dictionary, gives the optional possibility of specifying the following keys in the dictionary (note: the keys are mutually exclusive and specifying more than one has undefined behaviour).

filter: filters out (removes) the row, if the content inside this column doesn't match (exact string matching of the value of this key and the content of the cell). The value may be a list of strings to be matched.
regex: filters out (removes) the row, if the content inside this column doesn't match. The value of this key is placed directly into re.match(pattern, string) as the pattern and the cell value as the string. Note: Currently we assume that a small amount of regex's is used, such that we don't have to deal with compiling of regex's, but rely on the built in caching to handle it for us.

Example: This example won't filter out any column, but it demonstrates the three different ways that you may specify a column-filter. Just try and make changes to either one of them, and see how either columns or rows will be filtered from the resulting table.

``` {.table}
---
caption: "*Bar* table"
markdown: yes
column-filter:
    - 0
    - col: 1
      regex: ".*B|[\\d]"
    - col: 2
      filter: ['C', '3']
---
A,B,C
1,2,3

ickc commented 7 years ago

I think we need more discussion on this for the syntax of this. You might try to ask people in Markdown, tables and CSV - Google Groups to see if there's any suggestions there, and/or open an issue here (I'll open one soon). For now I'll put this pull request on hold.

ickc commented 7 years ago

And remember to include tests in pull requests. There's 2 kinds of tests here, one is Python unit test that calls the functions and compare the results. Another is to run pandoc directly and see if the output native AST is the same as a predefined one (usually generated automatically and just eyeballing to verify it's doing what it's supposed to do).

sergiocorreia commented 7 years ago

I think we need more discussion on this for the syntax of this

I also agree. Ideally, you want a solution that is both general and simple to implement. For instance, allowing lambdas that will be eval()uated at runtime

ickc commented 7 years ago

@reenberg, can you briefly describe what you want to accomplished exactly? i.e. let's forget about syntax and how to do it for the moment, but gives some small, before & after example on what you want to do. In particular, how you would want the regex to behave.

e.g. the simplest kind of filter will be 1,2,3,..... filtered to 1,2 only, extracting only the first 2 columns.

reenberg commented 7 years ago

My current issue is that I'm writing a document, where i have a spreadsheet of events.

This actually started out as a .csv file that i edited with a spreadsheet editor, but it has now evolved such that i found the need for using formulas (time calculations, column concatenation) and conditional formatting (to easily show groups of rows, etc) and thus it is now a .ods document, that I export to .csv.

The .csv file describes all the event data, such as type, start, end, various kinds of descriptions, work loads, etc. I use this information to generate various pieces of information in my main document. One example is a table of specific event types and some of their descriptions.

Thus my need is specifically to be able to filter only some of the columns (e.g., 1,3,4,6,7) and then I also need to filter the rows, such that I only get the rows concerning the specific event types.

This has previously been delt with by some nasty LaTeX macros, that I just couldn't bother maintainer any more.

My initial implementation with the 'filter' and 'regex' properties, was just what came to my mind when coding it. However specifically I'm using the regex right now to easily filter out 'event', 'event2' and 'event3'. I use suffix numbering of the event type to have the events in different colours when generating some of the other overviews (think something like graphs)

alerque commented 5 years ago

I'm accomplishing something similar using CSVKit, specifically csvcut to get just the columns I want in a preprocessing step before dumping the results into the markdown. There are quite a few tools with similar filtering capabilities including Python based ones. In general I think this workflow is better, I would be skeptical of putting a bunch of active code in the content of my data and would be skeptical of Pantable if it was trying to be a full fleged data manipulation tool rather than just a format conversion tool.

@reenberg Why do you think this should be implemented in Pantable itself?

ickc commented 5 years ago

The “pandoc way” to accomplish a task like this, without over bloating a filter, is to have another filter processing the filtering of the csv before pantable (ie piping a filter before pantable.) But inevitably this other filter before pantable has to be designed for pantable (eg which class to use.) So it is not strictly composable (ie not entirely independent of pantable.) So this hypothetical other filter is more like a pantable plugin, and hence may be why people want to put them together.

I think the solution you mention has to go through the shell (eg to me I’d use a makefile with an intermediate file chaining them together.) A solution like this is not universal. Also, a build like this makes the document less reproducible (in the sense that more details in how the document is built is needed.)

reenberg commented 5 years ago

Its always nice that someone cares, even if its just shy of 2 years since I left a reply to your comments @ickc.

To be honest, I don't think that I knew about CSVkit back then. And I guess I just fell victim of the classical "everything looks like a nail, when you have a hammer". I Don't think that my proposed changes does anything more than what can be achieved with a good combination of csvcut and csvgrep. So with that in mind this can ble discarded.

However I remember thinking that it was cleaner not having to setup an "elaborate build pipeline"/ Makefile to carve out all the intermediate files that I needed back then. It felt more smooth having the people writing the document and invoking pantable being able to just specify what data they needed from the csv file inside markdown. Not everyone is comfortable piping unix tools.

ickc / pantable

Added column-filter functionality #16