Closed shirk3y closed 9 years ago
I've made a pull request (#569) which proposes a solution to this issue.
I think the problem is not in the selecting methods .xpath()
and .css()
but on .extract()
, we can add a parameter to get the first result.
@darkrho: in #569, you propose .get()
but it doesn't extract, instead it returns the first element of a SelectorList that need to call .extract() to get the desired result on this issue:
>>> sel.css('span').get(0).extract()
u'<span>foo</span>'
I think the case is always to get the first element, indexing can still be addressed with xpath or css methods:
>>> sel.css('span').extract(first=True)
u'<span>foo</span>'
It also could be like this:
sel.css('span').extract_first()
Maybe it will prevent constructions that aren't clear much, eg.:
sel.css('span').extract(True)
2014-01-30 Daniel Graña notifications@github.com
I think the problem is not in the selecting methods .xpath() and .css()but on .extract(), we can add a parameter to get the first result.
@darkrho https://github.com/darkrho: in #569https://github.com/scrapy/scrapy/pull/569, you propose .get() but it doesn't extract, instead it returns the first element of a SelectorList that need to call .extract() to get the desired result on this issue:
sel.css('span').get(0).extract()
u'foo'
I would go for something like this:
sel.css('span').extract(first=True) u'foo'
I think the case is always to get the first element, indexing can still be addressed with xpath or css methods.
sel.css('span').extract(takefirst=True) u'foo'
— Reply to this email directly or view it on GitHubhttps://github.com/scrapy/scrapy/issues/568#issuecomment-33690050 .
I agree with with @shirk3y, instead of having an ambiguous method will be much better to have a new method. Another name could be .first()
.
I like more sel.xpath('//p/text()').extract(index)
if index is None
we get the list of results by default.
if the extract doesn't have that index return None
, instead of an empty list.
As @dangra said, you can always select n-th element using xpath, so makes it sense to write something like this?
sel.xpath('//p/text()').extract(2)
There is a difference on the data type returned
>>> sel = Selector(text='<div id="breadcrumb"><a href="./index">Home</a><a href="./Cat-1">Cat-1</a><a href="./Sub-1">Sub-1</a></div>', type='html')
>>> sel.xpath('//div[@id="breadcrumb"]/a/text()').extract()
[u'Home', u'Cat-1', u'Sub-1']
>>> sel.xpath('//div[@id="breadcrumb"]/a/text()').extract(1)
u'Cat-1'
>>> sel.xpath('(//div[@id="breadcrumb"]/a)[2]/text()').extract()
[u'Cat-1']
>>>
>>> sel.xpath('//div[@id="breadcrumb"]/a/text()').extract(6)
>>> sel.xpath('(//div[@id="breadcrumb"]/a)[7]/text()').extract()
[]
I like .extract_first()
more, because it is not a good style to change the return type based on method arguments. See also Guido's opinion: https://mail.python.org/pipermail/python-dev/2005-August/055907.html.
It looks like sel.xpath('//p/text()').extract(1)
would be less efficient than sel.xpath('//p[2]/text()').extract_first()
unless we employ some trick.
sel.xpath('//p/text()').first()
: I'd expect it to return a selector, not a raw string value; extract_first
wins again :)
Oh, but even Guido mentions there can be exceptions! ;)
I have to admit the parameter thing isn't a good idea, +1 to whatever name you guys choose for the new method.
Please see: #572
/cc our naming tsar @pablohoffman
I agree that extract_first
is a better name than just first
.
We've missed a .re('\d+').get(0)
case fimplemented in #569. And my comment about efficiency applies to .extract_first()
as well.
Maybe there should be re_first()
as well?
I like extract1()
/ re1()
, mainly because they're shorter, which is a big advantage for a method that's gonna appear very frequently in spider code.
I like @nramirezuy .extract(index)
and why not .re('\d+', index)
?
I was also leaning towards .extract(index)
, and .re('\d+', index)
, as that seems more versatile. (It's similar to what I'm currently using.)
But after thinking this through, it's correct that xpath or css selectors can always be tuned to get the correct single item.
So +1 to @pablohoffman for .extract1()
/ .re1()
. (How about an alias .x()
/ .x1()
/ .r()
?)
And for returning a single index (or range) as a list, there is still list slicing syntax .extract()[-1:]
.
(edit)
I would also add that extract1
/ x1
looks better, as it may be referred to as "extract first" or "extract one"; because whether the return is a first/last/middle value is in the eye of the selector ;) But we know it'll always be a single value.
There is a deprecated method .x()
. This is probably because it's too short and meaningless. +1 for .extract1()
/ .re1()
right, this isn't javascript (but sad).
I'm not a big fan of extract1()
and re1()
: they are more ambiguous than extract_first()
and re_first()
because "1" tells us these methods return a single element, but it doesn't tell us which element do they return exactly. Also, re1
and rel
look quite similar.
I wonder If we can mimic dict.get(key, default)
to retrieve default value instead of None
in case when no element was found.
It would be like this:
sel.xpath('//p/text()').extract_first(default='N/A')
@kmike,
I would argue the opposite. If you name it extract_first()
, people might looking for extract_last()
and miss the point of using selectors to achieve that goal. (Happened to me.)
Certainly if the selector returns multiple elements, it'll pick the first value of that list. But more often I would have a selector tailored to only return a single element in the first place, say xyz[last()]
.
And then from that perspective (of the selector) extract_first()
might even look strange, since I'm trying for the last element.
I think extract1()
would just as easily be remembered for both "first" and "single".
(The 1
vs l
issue I would always argue as a 'using the wrong font' problem :)
@shirk3y, interesting - but then IMO it'd feel more natural again to name it .get() where that behaviour is more expected, or as a second argument to extract() as extract(1, 'fail')
.
.extract1() or 'fail'
might be enough? Though I'm not opposed in general.
@shirk3y We should started asking, Why those projects doesn't use ItemLoaders? There is a real reason for not use them?
In my opinion, extract1()
could be useful if we want to create spider quickly, even without defining Item
s. Maybe it is not good practice not to use ItemLoader
s, but they are not necessary in some cases.
@nramirezuy I think there are several reasons for not using ItemLoaders. Let's try best practices:
# === items.py or loaders.py ===
from scrapy.contrib.loader import ItemLoader # BOILERPLATE
from scrapy.contrib.loader.processor import TakeFirst # BOILERPLATE
# ...
class MyItemLoader(ItemLoader): # BOILERPLATE
default_item_class = MyItem # BOILERPLATE
my_field_out = TakeFirst() # BOILERPLATE
# === myspider.py ===
from myproject.items import MyItemLoader # BOILERPLATE
# ...
def parse(self, response):
ld = MyItemLoader(response) # BOILERPLATE
# ...
return ld.load_item()
That's a lot of new concepts and 7 extra lines of code in 2 files, while user just wanted to do .extract()[0]
and avoid exceptions if nothing is extracted. Item loaders can be useful, and this boilerplate is justified if MyItemLoader is going to be used several times, or if it is possible to write some base item loader for many items, but having one item per spider is quite common.
Items loaders are powerful, but not simple. There are input processors, output processors (if you ask me I still can't tell the difference without re-reading the docs), Compose and MapCompose (the same), import locations to remember (from scrapy.contrib.loader
), magic class attributes (default_item_class
) and conventions for magic class attributes (_out
) to remember, and no docstrings to lookup that all in IDE. I still can't remember what arguments should processor take (a value? an iterable with values? maybe something different for input and output processors?) and what should they return.
@kmike You can use ItemLoaders straight forward without define a custom one.
TakeFisrt
is not the same as list[0]
>>> from scrapy.http import HtmlResponse
>>> from scrapy.contrib.loader import ItemLoader
>>> from scrapy.contrib.loader.processor import TakeFirst
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><div><a href="">Home</a><a href="./index">Index</a></div><div><a href="./faq">FAQ</a></div></body></html>')
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '//a/@href')
>>> loader.load_item()
{'url': [u'', u'./index', u'./faq']}
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '//a/@href', lambda x: x[0])
>>> loader.load_item()
{'url': [u'']}
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '//a/@href', TakeFirst())
>>> loader.load_item()
{'url': [u'./index']}
There is another use case. For example, if you want to extract link to the next page, but you don't need to store it in item, you'll probably use plain Selector
, and taking only first element becomes handy.
@shirk3y That is covered too. The only thing missing is an easier import.
>>> loader.get_xpath('//@href', TakeFirst())
u'./index'
This is becoming a discussion on using item loaders or not :)
My 2c on that: I agree with @kmike that item loaders can be powerful but are not easy. AFAIK, Item loader do not help when you traverse document and select different portions and within each select a few data bits with short selectors. More often than not, I find myself working with selector extracted data and then using add_value()
. That may not be best practice but that works for me.
And for the name, numbers in method names, especially 1
, makes me frown. And fonts? well sometimes you look at code from mobile phones, in stackoverflow, in google groups etc. you don't always choose it ;)
By the way, an interesting comparison are ElementTree and BeautifulSoup APIs that do it the other way: .find()
gets you the first element or None
, and .findall()
(or .find_all()
for BS) gets you the list.
And I must admit I want the first element much more often than a list.
@redapple, +1 to deprecate .extract()
and add .find()
and .findall()
(or something along that names). The .re()
can be a paremeter: .find(re="\d+")
, just like the loader which allows the optional re parameter.
@darkrho , as the searching part of "find" is done in .xpath()
or .css()
, maybe the .get()
idea above is what fits best (and .getall()
)
@redapple I'm just trying to point that this feature is not needed, Loaders are the way to go. I think the examples I posted here are not complicated at all.
I would like to hear something from the Elders (@dangra, @pablohoffman), about this.
>>> from scrapy.contrib.loader import ItemLoader
>>> from scrapy.contrib.loader.processor import Join
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><div><a href="">Home</a><a href="./index">Index</a></div><div><a href="./faq">FAQ</a></div></body></html>')
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '(//a/text())[1]')
>>> loader.add_xpath('url', '(//a/text())[2]')
>>> loader.load_item()
{'url': [u'Home', u'Index']}
>>> loader.url_out = Join(', ')
>>> loader.load_item()
{'url': u'Home, Index'}
I don't see ItemLoaders as an essential part of scrapy, but a useful extension. IMHO they shouldn't be a requirement, to keep the core functionality clean and simple. I like that they could be stripped out into an extension now, and there is not dependency on them.
And take a (banking) site, which randomizes everything down to form names and form input field names:
form = sel.xpath(XPATH)
username = form.xpath('.//input[@type="text"][1]').xpath('@name|@id').extract1()
password = form.xpath('.//input[@type="password"][1]').xpath('@name|@id').extract1()
return FormRequest.from_response(response,
formxpath=XPATH,
formdata={
username: 'some',
password: 'login',
more: data,
}
)
Should I need ItemLoaders for something like this?
You don't need it, but can use it.
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><form><input type="text" name="username" id="username"><input type="password" name="passwd" id="passwd"></form></body></html>'
>>> loader = ItemLoader(response=response)
>>> loader.get_xpath('//input[@type="text"]/@*[name()="name" or name()="id"]', TakeFirst())
u'username'
>>> loader.get_xpath('//input[@type="password"]/@*[name()="name" or name()="id"]', TakeFirst())
u'passwd'
The use of loader.get_xpath
is very compelling because many times I want
to extract a value, apply some cleaning and use it for an url or
something. The get_xpath
provides a clean API for this use case.
Nevertheless, I can't wrap my ahead around the idea of having to use an item loader when I don't want to load an item.
This might be solved by making selector.xpath
and selector.css
behave
like loader.get_xpath
. But this implies changing the API and don't
address the fact that selector.xpath
and selector.css
return a
SelectorList
.
On Wed, Feb 5, 2014 at 4:16 PM, Nicolás Alejandro Ramírez Quiros < notifications@github.com> wrote:
You don't need it, but can use it.
response = HtmlResponse('http://scrapinghub.com', body='
'>>> loader = ItemLoader(response=response)>>> loader.getxpath('//input[@type="text"]/@[name()="name" or name()="id"]', TakeFirst())u'username'>>> loader.getxpath('//input[@type="password"]/@[name()="name" or name()="id"]', TakeFirst())u'passwd'— Reply to this email directly or view it on GitHubhttps://github.com/scrapy/scrapy/issues/568#issuecomment-34233291 .
@darkrho If you want the SelectorList
you are not going to use this change. Use it for a url: https://github.com/scrapy/scrapy/issues/578, https://github.com/scrapy/scrapy/pull/346
I feel if we approve this change people will start using Selector
to populate the items, because is "easy" what is not in long term.
we definitely need some wisdom from the "elders" @pablohoffman @dangra :)
I feel if we approve this change people will start using
Selector
to populate the items [...]
This is already the case for most simple projects, and, as the more intuitive way, should be the goal going forward IMHO.
In fact, would there be any downside to moving ItemLoader's input processors to replace Selectors' static extract()
?
That looks like it could fix the problems on both sides - ItemLoaders that don't take SelectorList
s and Selectors that don't allow cleaning.
ItemLoaders take much time to learn, I think no newbie in IRC ever had ItemLoaders in their code.
They add code complexity vs. no payoff in a simple project, adding to the learning curve quite a bit. ItemLoaders, Processors, how they work, how they chain.
This is made more profound because of all the ways you can put them together. Define them in the item, or in the spider, put processors in loader subclasses or as arguments to l.add_*()
methods, then another layer of depth using contexts... and so on. I was certainly chewing more than a week on them initially, writing and rewriting, testing best conventions how to write them for myself.
I don't see why we should purposely steepen the learning curve by forcing them on people like this.
Besides in all the code of others I have seen to date, nobody came up with a solution as in https://github.com/scrapy/scrapy/issues/568#issuecomment-34233291, but they usually write a wrapper function as a workaround. So this is not intuitive.
And there is the unnecessary instancing of an item, which is never used, when using an itemloader like in this example earlier. In summary, ItemLoaders look too "fat", clumsy and like a hack for this usecase here.
Let's compare one of the Nicolas' examples (very compelling, by the way) with an example of how it could be written without item loaders:
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '(//a/text())[1]')
>>> loader.add_xpath('url', '(//a/text())[2]')
>>> loader.load_item()
{'url': [u'Home', u'Index']}
>>> loader.url_out = Join(', ')
>>> loader.load_item()
{'url': u'Home, Index'}
This is how it may look like without item loaders:
>>> sel = Selector(response)
>>> urls = sel.xpath('(//a/text())[1]').extract()
>>> urls.extend(sel.xpath('(//a/text())[2]').extract())
>>> dict(url=urls)
{'url': [u'Home', u'Index']}
>>> dict(url=', '.join(urls))
{'url': u'Home, Index'}
Pro item loader:
.extract()
boilerplate calls;url=...
again (and do it for each key) if we compute values using local variables;Neutral:
take_first(sel.xpath('...'))
vs ld.get_xpath('...', TakeFirst())
is a question of style; I think that both are fine. First is easier to grasp because it closely follows Python semantics - it is easier to understand just by looking at source code. The second is more declarative. There are packages like https://github.com/kachayev/fn.py that allows to combine functions in pipelines without too much (((())))) nesting if that is an issue.Cons item loader:
For me the main issue it that they are stateful. When you call sel.xpath(..)
or .extract()
you get a new value; when you call .add_xpath
you mutate some variable in a non-obvious way. I don't like loader.url_out = Join(', ')
either.
It is named "item loader", but when you work with it you should think of it as of an item with some extra voodoo, not as of data transformation pipeline.
Another issue is #578 - as Nicolas said, there are item loaders, selectors and link extractors, and they can do basically the same. ld.get_xpath
looks awfully like selectors. It is not good we have so many slightly different ways to do a basic task.
The third issue is complexity - I agree with @nyov here.
It is good this discussion is started before we merged extract_first()
PR or promoted item loaders from contrib :)
take_first(sel.xpath('...'))
vsld.get_xpath('...', TakeFirst())
is a question of style [...]
Not exactly, which is my other grief. The first looks ugly but would allow arbitrary xpaths passed, the second's xpath 'base' is locked in at ItemLoader instancing time in loader context and can't be re-set in it's lifetime, e.g. for traversing loops, partitioning a page into sections by parent tables or such.
And unless you need l.load_item()
, and only use l.get_xpath()
or such, you're just pulling along an instanced but never populated item for the loaders lifetime.
Another thought experiment...
Extractor
class which can use Processors (deprecating ItemLoader's input processing and maybe all the XpathItemLoader
functionality since it's unification?)Selector
's extract()
instead exposes an Extractor.process()
or smth., which can take Processors. (extract()
would equal extract(Identity())
maybe)LinkExtractors
become processors for Extractor; or subclasses?This would give us a separation of concerns here:
Selector
handles Selector(Lists)
Extractor
handles extraction with processors
LinkExtractor
s are special Extractor cases (processors)
ItemLoader
s get much more simple without the selector baggage and possibly reduced to a single 'output processor' chain?
Haven't quite thought this through, but it looks neat in concept ;)
Selector can be changed at any time, just need a fancy method.
>>> for sel in loader.selector.xpath('//a'):
... loader.selector = sel
... loader.get_xpath('.//text()')
...
[u'Home']
[u'Index']
[u'FAQ']
The item instantiation is not needed for data extraction, I used a dict
on some cases.
You can use multiple xpath on the same add_xpath
call.
>>> loader.add_xpath('url', ('(//a/text())[1]', '(//a/text())[2]'))
>>> loader.load_item()
{'url': [u'Home', u'Index']}
I like to have multiple processor chains on my ItemLoaders it is handy in some cases. I don't think you need to remove them, mainly because you can just use one and simplify.
I learned a lot on using ItemLoaders
from reading this thread.
I guess it could be commented in scrapy/templates/spiders/crawl.tmpl
to expose more users to it.
and @pablohoffman, please, extract1
is an awful name. :-1:
@barraponto so what was your suggestion?
either extract_one
or the pair get
and get_all
. I mean, adding numbers to methods to make their names shorter feels weird.
Using only built in python features, I came up with an approach for a similar scenario, an idiom for all lists when I expect exactly one member. Suppose I want to:
res = sel.xpath("//*[@name='Exactly Once']").extract()
assert 1 == len(res), IndexError('Expected exactly 1 match, not %d' % len(res))
res = res[0]
I abbreviate this as:
res, = sel.xpath("//*[@name='Exactly Once']").extract()
Slice or raise exception. Of course this doesn't apply everywhere but it covers many variations of this issue in one line. Unfortunately it doesn't solve this functionally but if it works I stop here. More idiomatic workarounds follow:
Sometimes I can just star the extracted list when passing it to a function that expects exactly one argument.
text_size = len(*sel.xpath("//*[@name='Exactly Once']").extract())
Functional assertions can be abbreviated too (but most don't like the following):
# Because int() optionally accepts a second argument
num = int((lambda i: i)(*sel.xpath("//*[@name='Exactly Once']").extract()))
Next subject, get() method with default value.
maybe_node0 = (sel.xpath("//*[@name='Exactly Once']").extract() or [None])[0]
maybe_node1 = (sel.xpath("//*[@name='Exactly Once']").extract()[1:] or [None])[0]
Speaking for my self, in cases where I just try to avoid multi line try/except and if/else constructs I don't feel like reinventing the wheel and I just use slices. Do such idioms satisfy anyone else?
@Digenis I like your approach, it is even better than extract_one()
because you can select the element you want on the list.
>>> a, = [][:1] or [None]
>>> a
>>> a, = [1, 2][:1] or [None]
>>> a
1
>>> a, = [1, 2, 3][1:2] or [None]
>>> a
2
>>> a, = []
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: need more than 0 values to unpack
Ok, so we could write
res, = sel.xpath("//*[@name='DoesNotExist']").extract() or [None]
But then we have a ValueError
again for multiple list items. So we need at least
res, = sel.xpath("//*[@name='MultipleItems']").extract()[:1] or [None]
Or the other one
maybe = (sel.xpath("//*[@name='DoesNotExist']").extract() or [None])[0]
This isn't so bad, but .extract()[:1] or [None]
vs. .extract1()
is still a difference. Besides having to remember to do list unpacking on the left hand side.
Do such idioms satisfy anyone else?
As @pablohoffman said, this is a method which is used very frequently in spider code, and I'd still prefer a shorthand instead of appending [:1] or [None]
all the time.
Now if extract()
/re()
would already return [None]
instead of an empty list, this could be shortened to extract()[0]
or extract()[-1:][0]
... which I might see as best solution yet, but maybe there'd be other issues with doing that.
Has anyone looked at the pull request #624 ? It passes all tests and builds upon the work from #572
@wsdookadr yes, #624 looks good, but we haven't decided yet if the feature itself is good or not.
I think about suggestion to improve scrapy Selector. I've seen this construction in many projects:
And what about
if result:
andelse:
, ortry:
andexcept
:, which should be always there? When we don't want ItemLoaders, the most common use of selector is retrieving only single element. Maybe there should be methodxpath1
orxpath_one
orxpath_first
that returns first matched element orNone
?