scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

add support for a nested loaders #1467

Closed dacjames closed 8 years ago

dacjames commented 8 years ago

Adds a nested_loader() method to ItemLoader. This is useful when you're selecting several related fields from a sub section of the document. Because the nested loader shares the item and values with it's parent, everything works as expected.

For example, here is a snippet of scraping code using the new functionality:

loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")
header_loader = loader.nested_loader(xpath="//div[@id = 'event_header']")
header_loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
loader.load_item()

You can achieve something similar by externally creating the nested loader, but you have to wire it up manually and remember to call load_item() on the nested loader.

loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")

# boilerplate that must be updated if MyLoader class is changed
header_loader = MyLoader(
    item=loader.item, 
    response=loader.response, 
    selector=loader.selector.xpath("//div[@id = 'event_header']"),
)
header_loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
header_loader.load_item()  # easy to forget
loader.load_item()

Tests have been added to ensure everything works correctly, including the ordering of values and that replace_* calls work correctly.

codecov-io commented 8 years ago

Current coverage is 82.38%

Merging #1467 into master will increase coverage by +0.05% as of 46ad0cb

Powered by Codecov. Updated on successful CI builds.

kmike commented 8 years ago

I like this feature, +1 to add it. Implementation looks fine. Could you please add some docs?

dacjames commented 8 years ago

How about that?

kmike commented 8 years ago

@dacjames looks good! I don't use contexts either, sorry.

dangra commented 8 years ago

@nramirezuy you are the all-mighty ItemLoaders reviewer and evangelist, what do you think?

nramirezuy commented 8 years ago

@dangra Well there are two things I can think of; I don't like right now:

For some reason I would like having indentation; kinda makes the blocks more clear and define a start and end to that child.

I would also like to remove response and selector from the context, and provide a way to move the loader in the meta. Sometimes makes it hard to implement processors; when you have the data in two different pages.

EDIT: Also why creating a new ItemLoader instance when we are just looking to substitute the Selector instance inside it.

dacjames commented 8 years ago

@nramirezuy I split the function into two calls, hopefully eliminating concern number 2. I agree on that front, this is better.

I'm very confused by the objection to sharing the item. That's kind of the entire purpose of this feature, to be able to populate the same item with a loader specialized for a particular subset of the page.

Creating a new ItemLoader instance is desirable because modifying the existing ItemLoader leads to unexpected behavior and I like being able to use both loaders at the same time.

... will fix the coverage issue shortly.

dacjames commented 8 years ago

Rebased on scrapy master to remove the coverage error.

kmike commented 8 years ago

@nramirezuy do you propose to change syntax to this?

loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")
with loader.nested_xpath("//div[@id = 'event_header']"):
    loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
    loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
    loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
loader.load_item()
dacjames commented 8 years ago

@kmike That syntax would be worse for my use case. I create two nested loaders, one for each section of the page that have partially redundant information, then go through each data point of interest and try to pull from both locations (and from the original loader in one case).

With the proposed syntax, there does not appear to be any way to reference the top-level, un-nested loader.

kmike commented 8 years ago

@dacjames could you show a code example?

dacjames commented 8 years ago

Sure, I need to filter out some private info, but I'll post an example later this evening.

On Fri, Sep 4, 2015 at 3:09 PM, Mikhail Korobov notifications@github.com wrote:

@dacjames https://github.com/dacjames could you show a code example?

— Reply to this email directly or view it on GitHub https://github.com/scrapy/scrapy/pull/1467#issuecomment-137865400.

dacjames commented 8 years ago

My code looks basically like this:

loader = CustomItemLoader(item=CustomItem(), response=response)
header_loader = loader.nested_xpath("//div[@id = 'event_header']")
summary_loader = loader.nested_xpath("//div[@id = 'summary']")

loader.add_value('src_url', response.url)

header_loader.add_xpath('item_name', "//span[@class = 'summary']/text()")
summary_loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
loader.add_xpath('item_name', "//title/text()")

header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
summary_loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")

header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
summary_loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
# ...

It could be converted to context manager format, something like:

loader = CustomItemLoader(item=CustomItem(), response=response)

loader.add_value('src_url', response.url)
loader.add_xpath('item_name', "//title/text()")

with loader.nested_xpath("//div[@id = 'event_header']"):
    loader.add_xpath('item_name', "//span[@class = 'summary']/text()")
    loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
    loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")

with loader.nested_xpath("//div[@id = 'summary']"):
    loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
    loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
    loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")

The behavior of this code is less obvious to me because the similar fields are not grouped together. More troubling, if you use the with ... as ... form, you can have an easy to miss bug (I think, haven't tested this code).

with loader.nested_xpath("//div[@id = 'summary']") as summary_loader:
    loader.add_xpath('item_name', "//title/text()")  # BUG: evaluated in nested scope
    summary_loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
    summary_loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
    summary_loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
nramirezuy commented 8 years ago

@dacjames having multiple instances of the same loader is an issue; first of all you gotta call load_item for each loader. Then the order matters, because it will override the field on the item.

kmike commented 8 years ago

@nramirezuy in @dacjames's implementation there is no need to call load_item() for each loader (it is a point of this PR), and an order matters just like it matters for loader.add_xpath calls with a single loader. Or did I get your comment wrong?

dacjames commented 8 years ago

@nramirezuy a nested loader is a distinct instance of ItemLoader that shares some state with its parent. Specificaly, the _values and items properties are shared. I landed on this implementation because it naturally preserves the expected ordering of add_* calls.

nramirezuy commented 8 years ago

@kmike @dacjames Sorry I got that wrong.

But I still think this is the way to go:

il = Itemloader(response=response)
il.add_xpath('item_name', "//title/text()")
with il.xpath("//div[@id = 'summary']") as selector:
    il.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()", selector=selector)
    il.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()", selector=selector)
    il.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()", selector=selector)

If you want you can:

il = Itemloader(response=response)
summary_selector = il.xpath("//div[@id = 'summary']")
il.add_xpath('item_name', "//title/text()")
il.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()", selector=summary_selector)
il.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()", selector=summary_selector)
il.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()", selector=summary_selector)
kmike commented 8 years ago

@nramirezuy what's an adavntage of writing selector=selector in all these .add_xpath calls, why do you prefer it?

nramirezuy commented 8 years ago

Because is what you wanna do in the ItemLoader, use a different selector for instance. Also gives you the possibility to give a selector from a different source. I think ideally ItemLoader shouldn't be attached to a Selector instance. This will allow us to move the ItemLoader in request.meta.

Something like this:

il = ItemLoader(item=item or {})
il.xpath('//title/text()', selector=response.selector)
with il(response.selector or response):
    il.xpath('//title/text()')
dacjames commented 8 years ago

@nramirezuy this PR doesn't preclude setting the selector per your example. It just provides a nice shortcut to avoid having to write the all the duplicate selector=loader.selector.xpath('/some/subquery') arguments.

I personally don't like the context manager approach because using it incorrectly is too easy.

il = ItemLoader(item=item)
# ...
with il(response.selector) as il2:
    assert il is il2
    # surprising! mutating objects as their own context managers is quite unusual
nramirezuy commented 8 years ago

@dacjames what am I missing?

>>> class A(object):
...     selector = None
...     @contextmanager
...     def __call__(self, selector):
...         self.selector = selector
...         yield self
...         self.selector = None
... 
>>> a = A()
>>> with a('a') as x:
...     assert a is x
...     a.selector
...     x.selector
... 
'a'
'a'
>>> a.selector
>>> x.selector
dacjames commented 8 years ago

I just don't like the idea of context managers that mutate themselves.

with a('a') as x:
    # I expect to be able to access a here and NOT have it mutated
    assert a.selector is None  # False!  Very surprising to me
    assert x.selector == 'a'  # True! expected.

Since I bound the context manager to x, I would expect the changes to only affect x, when in fact they mutate a for the duration of the with block.

In this instance, I would think I could use the outer loader within the with block if I bind with as, but in fact there is no way to access the unnested loader within the with block.

nramirezuy commented 8 years ago

I can't see the issue.

>>> f = open('tmp_file', 'w')
>>> with f as x:
...     assert f is x # True

The only difference with the loader is that we never close it, so we can keep using it.

dacjames commented 8 years ago

In that example, unlike your suggestion, with f does not mutate f at all. f inside the with block is exactly the same state as f outside the block.

I just don't see what advantage the context manager solution provides. It's the same amount of code and enforces a certain structure of usage. With nested loaders, I can group code how I want (see comment above), pass the nested loader to a utility function, and it's all around less magical.

nramirezuy commented 8 years ago

I think changing state inside a context is completely normal.

>>> f = open('tmp_file', 'w')
>>> f.closed
False
>>> with f:
...     f.closed
... 
False
>>> f.closed
True
>>> import threading
>>> lock = threading.Lock()
>>> lock.locked()
False
>>> with lock:
...     lock.locked()
... 
True
>>> lock.locked()
False

Well thats why you have a less magical way of doing things:

il = ItemLoader()
summary_selector = il.xpath("//div[@id = 'summary']")
il.add_xpath('item_name', "//title/text()")
il.add_xpath('item_name', ".//a[re:match(@id, 'name_')]/text()", selector=summary_selector)
il.add_xpath('start_date', ".//a[re:match(@id, 'startDate_')]/text()", selector=summary_selector)
il.add_xpath('end_date', ".//a[re:match(@id, 'endDate_')]/text()", selector=summary_selector)

or "Magical"

il = ItemLoader()
il.add_xpath('item_name', "//title/text()", selector=response.selector)
with il(response.xpath("//div[@id = 'summary']")):
    il.add_xpath('item_name', ".//a[re:match(@id, 'name_')]/text()")
    il.add_xpath('start_date', ".//a[re:match(@id, 'startDate_')]/text()")
    il.add_xpath('end_date', ".//a[re:match(@id, 'endDate_')]/text()")
dacjames commented 8 years ago

threading.Lock() is an interesting counter example, though still quite different from your usage. The lock object itself functions as the context manager, so it being mutated is expected. In your case, the loader is functioning both as a regular object and as the context manager, which is bad, IMO. To make the difference more clear this code would be analogous to threading.Lock():

loader = ItemLoader()
with loader:
    # do something

This code that you are suggesting is not:

loader = ItemLoader()
subloader = loader('//dev/foobar')
with subloader:
    # do something 

I think the nested loader provides for more readable code (mutation is always bad if it can be avoided), but you seem incorrigible so I don't see value in debating the topic further.

dangra commented 8 years ago

The functionality and code looks good to me, even if we eventually settle on a syntax sugar around contextmanagers it will be based on the changes included in this PR.

Code is clean, tested and covered; docs needs to be updated because of the split to nested_xpath/nested_css methods.

+1 to merge after fixing docs.

dacjames commented 8 years ago

Updated the docs for the method split.