scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Added JmesSelect #1016

Closed SudShekhar closed 9 years ago

SudShekhar commented 9 years ago

Created the first implementation of Json processor, any comments/edits are welcome.

I am currently using the jmespath (https://github.com/boto/jmespath) module to search for paths in the given list of values. I have added some sample test cases too. The processor will return the list of values unchanged in case the jmespath module isn't installed. I wasn't sure how to show a warning message in such a case and thus, have used the python print statement for now.

SudShekhar commented 9 years ago

Regarding the storage of ast module as self.astModule in the JsonProcessor class, I looked up the following document : https://wiki.python.org/moin/PythonSpeed/PerformanceTips (Import statement overhead topic). It seems that the current method should be acceptable. Kindly let me know if there are any other issues to be handled.

nramirezuy commented 9 years ago

We are also missing documentation.

nramirezuy commented 9 years ago

@kmike Can you review this? I'm looking for py3 and documentation feedback.

dangra commented 9 years ago

+1 LGTM. It needs to be rebased into a single commit before merging.

kmike commented 9 years ago

What do you think about supporting Python dicts/lists as an input?

The problem with the current implementation is that if there are several JmesProcessor extracting data from the same string they'll parse it again and again. I think it makes sense to allow users to pass already parsed data to the processor in order to avoid this extra work.

There are some edge cases though: e.g. if user pre-parses JSON string, and it doesn't have a top-level object (the result of decoding is a string), then JmesProcessor will try to parse the string again as JSON, which is wrong.

SudShekhar commented 9 years ago

IMO we can add a boolean param to the call function to decide whether to parse the string or not. In the case you have mentioned, the result would be None (which can be left for the user to handle). Does this sound okay?

nramirezuy commented 9 years ago

@kmike but sending non compatible "json strings", doesn't make too much sense either. Also it isn't wrong just nobody implemented it: https://docs.python.org/2.7/library/json.html#top-level-non-object-non-array-values

If you manage to get a "non-json string" as input for your Processor is because the chaining is wrong and deserve the ValueError.

SudShekhar commented 9 years ago

@nramirezuy: If we choose to load all strings as json objects (and return an error if this isn't possible), what should be the behaviour in case of empty strings? Should we return None or raise a ValueError?

kmike commented 9 years ago

@nramirezuy check this example:

# website.py
txt = "foo" 
txt_encoded = json.dumps(txt)  # encode a string to JSON and send it as a response

# myspider.py
data = json.loads(txt_encoded)  # we don't know what is the data, it is from a website
ld.add_json("foo.bar", data)  # assuming we implement add_json

Correct behaviour would be to return None, but with special-casing strings it'll try to decode JSON one more time and raise an exception.

nramirezuy commented 9 years ago

@kmike I kinda disagree on that because because if you are applying a path is because you are expecting something different than a string.

@SudShekhar I think empty strings have to return raise ValueError because aren't valid.

kmike commented 9 years ago

@nramirezuy still, the input was a valid JSON data which happen not to contain the value we requested. It is similar to asking for foo.bar when there is no 'foo' key in a JSON object - you expect 'foo' to be a dictionary with 'bar' key, when it is not the case. But the first query will raise a JSON parsing error, and the second will return None. And if a decoded string happen to contain valid JSON then the answer in first case would be just wrong.

Actually, I'm fine with leaving it as-is if this gotcha is documented. I agree that it is not a common case, and I don't see a way to handle it without making the API worse. Maybe we can have an option in JmesProcessor __init__ (not in __call__ method) to turn off string auto-decoding.

kmike commented 9 years ago

Or we can have 2 Jmes processors, one loading json and the another which doesn't load it

nramirezuy commented 9 years ago
class JmesProcessor(object):
    def __init__(self, path, process=None):
        self.process = process or self._process

    def _process(self, data):
         if isinstance(data, basestring):
              return json.loads(data)
         return data

But I guess doing the json.loads outside is not that hard.

# Load JSON and process
Compose(json.loads, JmesProcessor('foo'))
# Process several JSON objects: '[{"foo":"bar"}, {"baz":"tar"}]'
Compose(json.loads, MapCompose(JmesProcessor('foo')))
# Just process
JmesProcessor('foo')
# Chained processors with load
Compose(json.loads, JmesProcessor('foo'), JmesProcessor('bar'))

I like more the second option. @kmike @SudShekhar What do you think?

EDIT: We should also add examples to the doc.

kmike commented 9 years ago

I like not doing json.loads in the JmesProcessor, and the Compose / MapCompose examples are good.

@nramirezuy you should really create a few nice images/diagrams which show how Compose and MapCompose work :) What are the inputs, what are the outputs, how they can be used together. And maybe we should rename the processors to verbs - instead of JmesProcessor write SelectJmes - Compose(json.loads, SelectJmes('foo')) reads very well.

kmike commented 9 years ago

@nramirezuy a realted question: is it possible to use the processors without the item loaders?

get_foo = Compose(json.loads, SelectJmes('foo'))
# ...
value = get_foo(response.body_as_unicode())

is rather nice. If it works then I think it worths documenting. An maybe splitting this micro-framework from the item loaders, if it is not only about item loaders.

nramirezuy commented 9 years ago

Well every Processor works outside ItemLoaders; and that is completely valid. I'm fine with the processor rename there are no processors named Processor :tongue:

SudShekhar commented 9 years ago

@nramirezuy : The second options looks much better and it gives users more control. So, should I remove the json.loads from inside the processor? Regarding the documentation, the examples you have given look pretty comprehensive to me :smile: .I will add them and some others to the list. +1 to the renaming too.

EDIT: In the documentation, I didn't add the chaining example because I felt that it fit better in the Compose examples list (same can be said for the processor handling a list of json strings I guess). Do let me know your views on this.

nramirezuy commented 9 years ago

There is something with the markup but I don't know what it is.

/cc @kmike

SudShekhar commented 9 years ago

I created the documentation locally but was unable to figure out the error. Can you please point out the issue?

Thanks.

nramirezuy commented 9 years ago

I think you are missing :: in Working with Json: http://pasteboard.co/1mkMnIj3.png

SudShekhar commented 9 years ago

Hi, Is there anything else that needs to be changed in this commit? Thanks for all your feedback. /cc @nramirezuy @kmike

kmike commented 9 years ago

I think this PR is fine, +1 to merge it.

SudShekhar commented 9 years ago

Thanks :smile:

pablohoffman commented 9 years ago

Sorry I couldn't chime in before merge but, if we're gonna add jmespath as requirement (to requirements.txt) we should import it at module-level, not in the class constructor, don't you think?

kmike commented 9 years ago

@pablohoffman why does it matter? jmespath is not added to install_requires, it is optional.

Imports are very fast after the first successful import (a lookup in a dict) - by moving import to module level we won't get any speed benefits, but the exception may become less clear is jmespath is absent.

nramirezuy commented 9 years ago

@pablohoffman @kmike jmespath was added to tests/requirements.txt, not Scrapy ones.