scrapinghub / exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
BSD 3-Clause "New" or "Revised" License
40 stars 10 forks source link

[WIP] Multiple filters support #312

Closed bbotella closed 8 years ago

bbotella commented 8 years ago

We still lack composition. With this approach, we define some filters, and we can add a composition string to the options that would go like:

(filter_name1 AND filter_name2) OR filter_name3

Maybe we may need a custom interpreter to parse this composition string. Thoughts @eliasdorneles?

eliasdorneles commented 8 years ago

Hm, perhaps it would be nice to do some designing on the user interface first. :) I mean, how are we expecting people to use this?

It would be nice to arrive at something more or less usable (and preferably something that allowed people to copy'n paste filters).

So, here is one idea, using lists to represent the operations:

"filter": {
    "name": "exporters.filters.MultipleFilter",
    "options": {
        "filters": ["or",
            {"name": "exporters.filters.PythonexpFilter", "options": {...}},
            {"name": "exporters.filters.KeyValueFilter", "options": {...}},
            ["and",
                {"name": "exporters.filters.PythonexpFilter", "options": {...}},
                {"name": "exporters.filters.KeyValueFilter", "options": {...}},
            ]
        ]
    }
}

Here is another, using dicts:

"filter": {
    "name": "exporters.filters.MultipleFilter",
    "options": {
        "filters": {
          "or": [
            {"name": "exporters.filters.PythonexpFilter", "options": {...}},
            {"name": "exporters.filters.KeyValueFilter", "options": {...}},
            {"and": [
                {"name": "exporters.filters.PythonexpFilter", "options": {...}},
                {"name": "exporters.filters.KeyValueFilter", "options": {...}},
            ]},
            {"name": "exporters.filters.KeyValueRegexFilter", "options": {...}},
        ]}
    }
}

I like more the syntax of this last one, but it will require more validation than the latter (because it can't allow a dict representing a filter to have both "and" and "or" in the same object).

The validation for lists is a bit more simple (if it's a list, the first element has to be a string). So, it's a bit more like "worse is better" (simpler implementation, syntax not much friendly).

eliasdorneles commented 8 years ago

I see I forgot to ask... what do you think? :)

bbotella commented 8 years ago

My idea was to support something like:

{
    'filters': {
        'filter1': {
            'name': 'exporters.filters.KeyValueFilter',
            'options': {
                'keys': [{'name': 'country_code', 'value': 'es', 'operator': 'contains'}]
            }
        },
        'filter2': {
            'name': 'exporters.filters.KeyValueFilter',
            'options': {
                'keys': [{'name': 'name', 'value': 'item1', 'operator': 'contains'}]
            }
        },
        'filter3': {
            'name': 'exporters.filters.KeyValueFilter',
            'options': {
                'keys': [{'name': 'name', 'value': 'item3', 'operator': 'contains'}]
            }
        }
    },
    'composition': '(filter1 and filter2) or filter3'
}

This would allow us to make clear and user friendly compositions, only by replacing filter names in composition string by their filter value, and running that resulting composition string on the python interpreter (that would be something like '(True and False) or True')

eliasdorneles commented 8 years ago

Hmm, as an user, I would not like all that indirection (having to come up with names for filters and having to type them twice) and it doesn't look like this would make implementation simpler.

I also don't think that having to do string replacement for every record passing over the filter and evaluating it is an advantage for that approach.

With any of the two options I mentioned, the filter would be "compiled" into a function in the constructor (Python code for the "and" and "or") and then calling the filter for each record would be simply that function call (no need for eval).

eliasdorneles commented 8 years ago

Btw, a good source of inspiration is MongoDB query filters. :)

tsrdatatech commented 8 years ago

@bbotella @eliasdorneles we have this PR https://github.com/scrapinghub/exporters/pull/325 that is almost done, do you think it makes sense to keep this one or close it?

eliasdorneles commented 8 years ago

We can close it, yes, this was another PR done as an experiment and discussion starter. =) Thanks @bbotella ! <3