Closed dacjames closed 8 years ago
82.38%
Merging #1467 into master will increase coverage by +0.05% as of
46ad0cb
Powered by Codecov. Updated on successful CI builds.
I like this feature, +1 to add it. Implementation looks fine. Could you please add some docs?
How about that?
@dacjames looks good! I don't use contexts either, sorry.
@nramirezuy you are the all-mighty ItemLoaders reviewer and evangelist, what do you think?
@dangra Well there are two things I can think of; I don't like right now:
For some reason I would like having indentation; kinda makes the blocks more clear and define a start and end to that child.
I would also like to remove response and selector from the context, and provide a way to move the loader in the meta. Sometimes makes it hard to implement processors; when you have the data in two different pages.
EDIT: Also why creating a new ItemLoader instance when we are just looking to substitute the Selector instance inside it.
@nramirezuy I split the function into two calls, hopefully eliminating concern number 2. I agree on that front, this is better.
I'm very confused by the objection to sharing the item. That's kind of the entire purpose of this feature, to be able to populate the same item with a loader specialized for a particular subset of the page.
Creating a new ItemLoader instance is desirable because modifying the existing ItemLoader leads to unexpected behavior and I like being able to use both loaders at the same time.
... will fix the coverage issue shortly.
Rebased on scrapy master to remove the coverage error.
@nramirezuy do you propose to change syntax to this?
loader = MyLoader(item=MyItem(), response=response)
loader.add_xpath('images', "//div[@class = 'event_title_image']/img/@src")
with loader.nested_xpath("//div[@id = 'event_header']"):
loader.add_xpath('event_name', "//span[@class = 'summary']/text()")
loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
loader.load_item()
@kmike That syntax would be worse for my use case. I create two nested loaders, one for each section of the page that have partially redundant information, then go through each data point of interest and try to pull from both locations (and from the original loader in one case).
With the proposed syntax, there does not appear to be any way to reference the top-level, un-nested loader.
@dacjames could you show a code example?
Sure, I need to filter out some private info, but I'll post an example later this evening.
On Fri, Sep 4, 2015 at 3:09 PM, Mikhail Korobov notifications@github.com wrote:
@dacjames https://github.com/dacjames could you show a code example?
— Reply to this email directly or view it on GitHub https://github.com/scrapy/scrapy/pull/1467#issuecomment-137865400.
My code looks basically like this:
loader = CustomItemLoader(item=CustomItem(), response=response)
header_loader = loader.nested_xpath("//div[@id = 'event_header']")
summary_loader = loader.nested_xpath("//div[@id = 'summary']")
loader.add_value('src_url', response.url)
header_loader.add_xpath('item_name', "//span[@class = 'summary']/text()")
summary_loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
loader.add_xpath('item_name', "//title/text()")
header_loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
summary_loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
header_loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
summary_loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
# ...
It could be converted to context manager format, something like:
loader = CustomItemLoader(item=CustomItem(), response=response)
loader.add_value('src_url', response.url)
loader.add_xpath('item_name', "//title/text()")
with loader.nested_xpath("//div[@id = 'event_header']"):
loader.add_xpath('item_name', "//span[@class = 'summary']/text()")
loader.add_xpath('start_date', "//meta[@itemprop = 'startDate']/@content")
loader.add_xpath('end_date', "//meta[@itemprop = 'endDate']/@content")
with loader.nested_xpath("//div[@id = 'summary']"):
loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
The behavior of this code is less obvious to me because the similar fields are not grouped together. More troubling, if you use the with ... as ...
form, you can have an easy to miss bug (I think, haven't tested this code).
with loader.nested_xpath("//div[@id = 'summary']") as summary_loader:
loader.add_xpath('item_name', "//title/text()") # BUG: evaluated in nested scope
summary_loader.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()")
summary_loader.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()")
summary_loader.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()")
@dacjames having multiple instances of the same loader is an issue; first of all you gotta call load_item
for each loader. Then the order matters, because it will override the field on the item.
@nramirezuy in @dacjames's implementation there is no need to call load_item()
for each loader (it is a point of this PR), and an order matters just like it matters for loader.add_xpath
calls with a single loader. Or did I get your comment wrong?
@kmike @dacjames Sorry I got that wrong.
But I still think this is the way to go:
il = Itemloader(response=response)
il.add_xpath('item_name', "//title/text()")
with il.xpath("//div[@id = 'summary']") as selector:
il.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()", selector=selector)
il.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()", selector=selector)
il.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()", selector=selector)
If you want you can:
il = Itemloader(response=response)
summary_selector = il.xpath("//div[@id = 'summary']")
il.add_xpath('item_name', "//title/text()")
il.add_xpath('item_name', "//a[re:match(@id, 'name_')]/text()", selector=summary_selector)
il.add_xpath('start_date', "//a[re:match(@id, 'startDate_')]/text()", selector=summary_selector)
il.add_xpath('end_date', "//a[re:match(@id, 'endDate_')]/text()", selector=summary_selector)
@nramirezuy what's an adavntage of writing selector=selector
in all these .add_xpath
calls, why do you prefer it?
Because is what you wanna do in the ItemLoader
, use a different selector
for instance. Also gives you the possibility to give a selector from a different source.
I think ideally ItemLoader
shouldn't be attached to a Selector
instance. This will allow us to move the ItemLoader
in request.meta
.
Something like this:
il = ItemLoader(item=item or {})
il.xpath('//title/text()', selector=response.selector)
with il(response.selector or response):
il.xpath('//title/text()')
@nramirezuy this PR doesn't preclude setting the selector per your example. It just provides a nice shortcut to avoid having to write the all the duplicate selector=loader.selector.xpath('/some/subquery')
arguments.
I personally don't like the context manager approach because using it incorrectly is too easy.
il = ItemLoader(item=item)
# ...
with il(response.selector) as il2:
assert il is il2
# surprising! mutating objects as their own context managers is quite unusual
@dacjames what am I missing?
>>> class A(object):
... selector = None
... @contextmanager
... def __call__(self, selector):
... self.selector = selector
... yield self
... self.selector = None
...
>>> a = A()
>>> with a('a') as x:
... assert a is x
... a.selector
... x.selector
...
'a'
'a'
>>> a.selector
>>> x.selector
I just don't like the idea of context managers that mutate themselves.
with a('a') as x:
# I expect to be able to access a here and NOT have it mutated
assert a.selector is None # False! Very surprising to me
assert x.selector == 'a' # True! expected.
Since I bound the context manager to x
, I would expect the changes to only affect x
, when in fact they mutate a
for the duration of the with
block.
In this instance, I would think I could use the outer loader within the with
block if I bind with as
, but in fact there is no way to access the unnested loader within the with
block.
I can't see the issue.
>>> f = open('tmp_file', 'w')
>>> with f as x:
... assert f is x # True
The only difference with the loader is that we never close it, so we can keep using it.
In that example, unlike your suggestion, with f
does not mutate f
at all. f
inside the with
block is exactly the same state as f
outside the block.
I just don't see what advantage the context manager solution provides. It's the same amount of code and enforces a certain structure of usage. With nested loaders, I can group code how I want (see comment above), pass the nested loader to a utility function, and it's all around less magical.
I think changing state inside a context is completely normal.
>>> f = open('tmp_file', 'w')
>>> f.closed
False
>>> with f:
... f.closed
...
False
>>> f.closed
True
>>> import threading
>>> lock = threading.Lock()
>>> lock.locked()
False
>>> with lock:
... lock.locked()
...
True
>>> lock.locked()
False
Well thats why you have a less magical way of doing things:
il = ItemLoader()
summary_selector = il.xpath("//div[@id = 'summary']")
il.add_xpath('item_name', "//title/text()")
il.add_xpath('item_name', ".//a[re:match(@id, 'name_')]/text()", selector=summary_selector)
il.add_xpath('start_date', ".//a[re:match(@id, 'startDate_')]/text()", selector=summary_selector)
il.add_xpath('end_date', ".//a[re:match(@id, 'endDate_')]/text()", selector=summary_selector)
or "Magical"
il = ItemLoader()
il.add_xpath('item_name', "//title/text()", selector=response.selector)
with il(response.xpath("//div[@id = 'summary']")):
il.add_xpath('item_name', ".//a[re:match(@id, 'name_')]/text()")
il.add_xpath('start_date', ".//a[re:match(@id, 'startDate_')]/text()")
il.add_xpath('end_date', ".//a[re:match(@id, 'endDate_')]/text()")
threading.Lock()
is an interesting counter example, though still quite different from your usage. The lock object itself functions as the context manager, so it being mutated is expected. In your case, the loader is functioning both as a regular object and as the context manager, which is bad, IMO. To make the difference more clear this code would be analogous to threading.Lock():
loader = ItemLoader()
with loader:
# do something
This code that you are suggesting is not:
loader = ItemLoader()
subloader = loader('//dev/foobar')
with subloader:
# do something
I think the nested loader provides for more readable code (mutation is always bad if it can be avoided), but you seem incorrigible so I don't see value in debating the topic further.
The functionality and code looks good to me, even if we eventually settle on a syntax sugar around contextmanagers it will be based on the changes included in this PR.
Code is clean, tested and covered; docs needs to be updated because of the split to nested_xpath/nested_css methods.
+1 to merge after fixing docs.
Updated the docs for the method split.
Adds a nested_loader() method to ItemLoader. This is useful when you're selecting several related fields from a sub section of the document. Because the nested loader shares the item and values with it's parent, everything works as expected.
For example, here is a snippet of scraping code using the new functionality:
You can achieve something similar by externally creating the nested loader, but you have to wire it up manually and remember to call
load_item()
on the nested loader.Tests have been added to ensure everything works correctly, including the ordering of values and that
replace_*
calls work correctly.