scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

New selector method: extract_first() #568

Closed shirk3y closed 9 years ago

shirk3y commented 10 years ago

I think about suggestion to improve scrapy Selector. I've seen this construction in many projects:

result = sel.xpath('//div/text()').extract()[0]

And what about if result: and else:, or try: and except:, which should be always there? When we don't want ItemLoaders, the most common use of selector is retrieving only single element. Maybe there should be method xpath1 or xpath_one or xpath_first that returns first matched element or None?

rmax commented 10 years ago

I've made a pull request (#569) which proposes a solution to this issue.

dangra commented 10 years ago

I think the problem is not in the selecting methods .xpath() and .css() but on .extract(), we can add a parameter to get the first result.

@darkrho: in #569, you propose .get() but it doesn't extract, instead it returns the first element of a SelectorList that need to call .extract() to get the desired result on this issue:

>>> sel.css('span').get(0).extract()
u'<span>foo</span>'

I think the case is always to get the first element, indexing can still be addressed with xpath or css methods:

>>> sel.css('span').extract(first=True)
u'<span>foo</span>'
shirk3y commented 10 years ago

It also could be like this:

sel.css('span').extract_first()

Maybe it will prevent constructions that aren't clear much, eg.:

sel.css('span').extract(True)

2014-01-30 Daniel Graña notifications@github.com

I think the problem is not in the selecting methods .xpath() and .css()but on .extract(), we can add a parameter to get the first result.

@darkrho https://github.com/darkrho: in #569https://github.com/scrapy/scrapy/pull/569, you propose .get() but it doesn't extract, instead it returns the first element of a SelectorList that need to call .extract() to get the desired result on this issue:

sel.css('span').get(0).extract()

u'foo'

I would go for something like this:

sel.css('span').extract(first=True) u'foo'

I think the case is always to get the first element, indexing can still be addressed with xpath or css methods.

sel.css('span').extract(takefirst=True) u'foo'

— Reply to this email directly or view it on GitHubhttps://github.com/scrapy/scrapy/issues/568#issuecomment-33690050 .

rmax commented 10 years ago

I agree with with @shirk3y, instead of having an ambiguous method will be much better to have a new method. Another name could be .first().

nramirezuy commented 10 years ago

I like more sel.xpath('//p/text()').extract(index) if index is None we get the list of results by default. if the extract doesn't have that index return None, instead of an empty list.

shirk3y commented 10 years ago

As @dangra said, you can always select n-th element using xpath, so makes it sense to write something like this? sel.xpath('//p/text()').extract(2)

nramirezuy commented 10 years ago

There is a difference on the data type returned

>>> sel = Selector(text='<div id="breadcrumb"><a href="./index">Home</a><a href="./Cat-1">Cat-1</a><a href="./Sub-1">Sub-1</a></div>', type='html')
>>> sel.xpath('//div[@id="breadcrumb"]/a/text()').extract()
[u'Home', u'Cat-1', u'Sub-1']
>>> sel.xpath('//div[@id="breadcrumb"]/a/text()').extract(1)
u'Cat-1'
>>> sel.xpath('(//div[@id="breadcrumb"]/a)[2]/text()').extract()
[u'Cat-1']
>>> 
>>> sel.xpath('//div[@id="breadcrumb"]/a/text()').extract(6)
>>> sel.xpath('(//div[@id="breadcrumb"]/a)[7]/text()').extract()
[]
kmike commented 10 years ago

I like .extract_first() more, because it is not a good style to change the return type based on method arguments. See also Guido's opinion: https://mail.python.org/pipermail/python-dev/2005-August/055907.html.

It looks like sel.xpath('//p/text()').extract(1) would be less efficient than sel.xpath('//p[2]/text()').extract_first() unless we employ some trick.

sel.xpath('//p/text()').first(): I'd expect it to return a selector, not a raw string value; extract_first wins again :)

dangra commented 10 years ago

Oh, but even Guido mentions there can be exceptions! ;)

I have to admit the parameter thing isn't a good idea, +1 to whatever name you guys choose for the new method.

shirk3y commented 10 years ago

Please see: #572

kmike commented 10 years ago

/cc our naming tsar @pablohoffman

rmax commented 10 years ago

I agree that extract_first is a better name than just first.

kmike commented 10 years ago

We've missed a .re('\d+').get(0) case fimplemented in #569. And my comment about efficiency applies to .extract_first() as well.

shirk3y commented 10 years ago

Maybe there should be re_first() as well?

pablohoffman commented 10 years ago

I like extract1() / re1(), mainly because they're shorter, which is a big advantage for a method that's gonna appear very frequently in spider code.

umrashrf commented 10 years ago

I like @nramirezuy .extract(index) and why not .re('\d+', index)?

nyov commented 10 years ago

I was also leaning towards .extract(index), and .re('\d+', index), as that seems more versatile. (It's similar to what I'm currently using.) But after thinking this through, it's correct that xpath or css selectors can always be tuned to get the correct single item.

So +1 to @pablohoffman for .extract1() / .re1(). (How about an alias .x() / .x1() / .r() ?)

And for returning a single index (or range) as a list, there is still list slicing syntax .extract()[-1:].

(edit) I would also add that extract1 / x1 looks better, as it may be referred to as "extract first" or "extract one"; because whether the return is a first/last/middle value is in the eye of the selector ;) But we know it'll always be a single value.

shirk3y commented 10 years ago

There is a deprecated method .x(). This is probably because it's too short and meaningless. +1 for .extract1() / .re1()

nyov commented 10 years ago

right, this isn't javascript (but sad).

kmike commented 10 years ago

I'm not a big fan of extract1() and re1(): they are more ambiguous than extract_first() and re_first() because "1" tells us these methods return a single element, but it doesn't tell us which element do they return exactly. Also, re1 and rel look quite similar.

shirk3y commented 10 years ago

I wonder If we can mimic dict.get(key, default) to retrieve default value instead of None in case when no element was found.

It would be like this: sel.xpath('//p/text()').extract_first(default='N/A')

nyov commented 10 years ago

@kmike, I would argue the opposite. If you name it extract_first(), people might looking for extract_last() and miss the point of using selectors to achieve that goal. (Happened to me.)

Certainly if the selector returns multiple elements, it'll pick the first value of that list. But more often I would have a selector tailored to only return a single element in the first place, say xyz[last()]. And then from that perspective (of the selector) extract_first() might even look strange, since I'm trying for the last element. I think extract1() would just as easily be remembered for both "first" and "single". (The 1 vs l issue I would always argue as a 'using the wrong font' problem :)

@shirk3y, interesting - but then IMO it'd feel more natural again to name it .get() where that behaviour is more expected, or as a second argument to extract() as extract(1, 'fail'). .extract1() or 'fail' might be enough? Though I'm not opposed in general.

nramirezuy commented 10 years ago

@shirk3y We should started asking, Why those projects doesn't use ItemLoaders? There is a real reason for not use them?

shirk3y commented 10 years ago

In my opinion, extract1() could be useful if we want to create spider quickly, even without defining Items. Maybe it is not good practice not to use ItemLoaders, but they are not necessary in some cases.

kmike commented 10 years ago

@nramirezuy I think there are several reasons for not using ItemLoaders. Let's try best practices:

# === items.py or loaders.py ===

from scrapy.contrib.loader import ItemLoader  # BOILERPLATE
from scrapy.contrib.loader.processor import TakeFirst  # BOILERPLATE

# ...

class MyItemLoader(ItemLoader):  # BOILERPLATE
    default_item_class = MyItem  # BOILERPLATE
    my_field_out = TakeFirst()  # BOILERPLATE

# === myspider.py ===
from myproject.items import MyItemLoader  # BOILERPLATE
# ...

    def parse(self, response):
        ld = MyItemLoader(response)  # BOILERPLATE
        # ...
        return ld.load_item()

That's a lot of new concepts and 7 extra lines of code in 2 files, while user just wanted to do .extract()[0] and avoid exceptions if nothing is extracted. Item loaders can be useful, and this boilerplate is justified if MyItemLoader is going to be used several times, or if it is possible to write some base item loader for many items, but having one item per spider is quite common.

Items loaders are powerful, but not simple. There are input processors, output processors (if you ask me I still can't tell the difference without re-reading the docs), Compose and MapCompose (the same), import locations to remember (from scrapy.contrib.loader), magic class attributes (default_item_class) and conventions for magic class attributes (_out) to remember, and no docstrings to lookup that all in IDE. I still can't remember what arguments should processor take (a value? an iterable with values? maybe something different for input and output processors?) and what should they return.

nramirezuy commented 10 years ago

@kmike You can use ItemLoaders straight forward without define a custom one. TakeFisrt is not the same as list[0]

>>> from scrapy.http import HtmlResponse
>>> from scrapy.contrib.loader import ItemLoader
>>> from scrapy.contrib.loader.processor import TakeFirst
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><div><a href="">Home</a><a href="./index">Index</a></div><div><a href="./faq">FAQ</a></div></body></html>')
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '//a/@href')
>>> loader.load_item()
{'url': [u'', u'./index', u'./faq']}
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '//a/@href', lambda x: x[0])
>>> loader.load_item()
{'url': [u'']}
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '//a/@href', TakeFirst())
>>> loader.load_item()
{'url': [u'./index']}
shirk3y commented 10 years ago

There is another use case. For example, if you want to extract link to the next page, but you don't need to store it in item, you'll probably use plain Selector, and taking only first element becomes handy.

nramirezuy commented 10 years ago

@shirk3y That is covered too. The only thing missing is an easier import.

>>> loader.get_xpath('//@href', TakeFirst())
u'./index'
redapple commented 10 years ago

This is becoming a discussion on using item loaders or not :) My 2c on that: I agree with @kmike that item loaders can be powerful but are not easy. AFAIK, Item loader do not help when you traverse document and select different portions and within each select a few data bits with short selectors. More often than not, I find myself working with selector extracted data and then using add_value(). That may not be best practice but that works for me.

And for the name, numbers in method names, especially 1, makes me frown. And fonts? well sometimes you look at code from mobile phones, in stackoverflow, in google groups etc. you don't always choose it ;)

redapple commented 10 years ago

By the way, an interesting comparison are ElementTree and BeautifulSoup APIs that do it the other way: .find() gets you the first element or None, and .findall() (or .find_all() for BS) gets you the list.

And I must admit I want the first element much more often than a list.

rmax commented 10 years ago

@redapple, +1 to deprecate .extract() and add .find() and .findall() (or something along that names). The .re() can be a paremeter: .find(re="\d+"), just like the loader which allows the optional re parameter.

redapple commented 10 years ago

@darkrho , as the searching part of "find" is done in .xpath() or .css(), maybe the .get() idea above is what fits best (and .getall())

nramirezuy commented 10 years ago

@redapple I'm just trying to point that this feature is not needed, Loaders are the way to go. I think the examples I posted here are not complicated at all.

I would like to hear something from the Elders (@dangra, @pablohoffman), about this.

>>> from scrapy.contrib.loader import ItemLoader
>>> from scrapy.contrib.loader.processor import Join
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><div><a href="">Home</a><a href="./index">Index</a></div><div><a href="./faq">FAQ</a></div></body></html>')
>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '(//a/text())[1]')
>>> loader.add_xpath('url', '(//a/text())[2]')
>>> loader.load_item()
{'url': [u'Home', u'Index']}
>>> loader.url_out = Join(', ')
>>> loader.load_item()
{'url': u'Home, Index'}
nyov commented 10 years ago

I don't see ItemLoaders as an essential part of scrapy, but a useful extension. IMHO they shouldn't be a requirement, to keep the core functionality clean and simple. I like that they could be stripped out into an extension now, and there is not dependency on them.

And take a (banking) site, which randomizes everything down to form names and form input field names:

form = sel.xpath(XPATH)
username = form.xpath('.//input[@type="text"][1]').xpath('@name|@id').extract1()
password = form.xpath('.//input[@type="password"][1]').xpath('@name|@id').extract1()
return FormRequest.from_response(response,
    formxpath=XPATH,
    formdata={
        username: 'some',
        password: 'login',
        more: data,
    }
)

Should I need ItemLoaders for something like this?

nramirezuy commented 10 years ago

You don't need it, but can use it.

>>> response = HtmlResponse('http://scrapinghub.com', body='<html><body><form><input type="text" name="username" id="username"><input type="password" name="passwd" id="passwd"></form></body></html>'
>>> loader = ItemLoader(response=response)
>>> loader.get_xpath('//input[@type="text"]/@*[name()="name" or name()="id"]', TakeFirst())
u'username'
>>> loader.get_xpath('//input[@type="password"]/@*[name()="name" or name()="id"]', TakeFirst())
u'passwd'
rmax commented 10 years ago

The use of loader.get_xpath is very compelling because many times I want to extract a value, apply some cleaning and use it for an url or something. The get_xpath provides a clean API for this use case.

Nevertheless, I can't wrap my ahead around the idea of having to use an item loader when I don't want to load an item.

This might be solved by making selector.xpath and selector.css behave like loader.get_xpath. But this implies changing the API and don't address the fact that selector.xpath and selector.css return a SelectorList.

On Wed, Feb 5, 2014 at 4:16 PM, Nicolás Alejandro Ramírez Quiros < notifications@github.com> wrote:

You don't need it, but can use it.

response = HtmlResponse('http://scrapinghub.com', body='

'>>> loader = ItemLoader(response=response)>>> loader.getxpath('//input[@type="text"]/@[name()="name" or name()="id"]', TakeFirst())u'username'>>> loader.getxpath('//input[@type="password"]/@[name()="name" or name()="id"]', TakeFirst())u'passwd'

— Reply to this email directly or view it on GitHubhttps://github.com/scrapy/scrapy/issues/568#issuecomment-34233291 .

nramirezuy commented 10 years ago

@darkrho If you want the SelectorListyou are not going to use this change. Use it for a url: https://github.com/scrapy/scrapy/issues/578, https://github.com/scrapy/scrapy/pull/346

I feel if we approve this change people will start using Selector to populate the items, because is "easy" what is not in long term.

redapple commented 10 years ago

we definitely need some wisdom from the "elders" @pablohoffman @dangra :)

nyov commented 10 years ago

I feel if we approve this change people will start using Selector to populate the items [...]

This is already the case for most simple projects, and, as the more intuitive way, should be the goal going forward IMHO. In fact, would there be any downside to moving ItemLoader's input processors to replace Selectors' static extract()? That looks like it could fix the problems on both sides - ItemLoaders that don't take SelectorLists and Selectors that don't allow cleaning.

ItemLoaders take much time to learn, I think no newbie in IRC ever had ItemLoaders in their code.

They add code complexity vs. no payoff in a simple project, adding to the learning curve quite a bit. ItemLoaders, Processors, how they work, how they chain.

This is made more profound because of all the ways you can put them together. Define them in the item, or in the spider, put processors in loader subclasses or as arguments to l.add_*() methods, then another layer of depth using contexts... and so on. I was certainly chewing more than a week on them initially, writing and rewriting, testing best conventions how to write them for myself.

I don't see why we should purposely steepen the learning curve by forcing them on people like this.

Besides in all the code of others I have seen to date, nobody came up with a solution as in https://github.com/scrapy/scrapy/issues/568#issuecomment-34233291, but they usually write a wrapper function as a workaround. So this is not intuitive.

And there is the unnecessary instancing of an item, which is never used, when using an itemloader like in this example earlier. In summary, ItemLoaders look too "fat", clumsy and like a hack for this usecase here.

kmike commented 10 years ago

Let's compare one of the Nicolas' examples (very compelling, by the way) with an example of how it could be written without item loaders:

>>> loader = ItemLoader(response=response, item={})
>>> loader.add_xpath('url', '(//a/text())[1]')
>>> loader.add_xpath('url', '(//a/text())[2]')
>>> loader.load_item()
{'url': [u'Home', u'Index']}
>>> loader.url_out = Join(', ')
>>> loader.load_item()
{'url': u'Home, Index'}

This is how it may look like without item loaders:

>>> sel = Selector(response)
>>> urls = sel.xpath('(//a/text())[1]').extract()
>>> urls.extend(sel.xpath('(//a/text())[2]').extract())
>>> dict(url=urls)
{'url': [u'Home', u'Index']}
>>> dict(url=', '.join(urls)) 
{'url': u'Home, Index'}

Pro item loader:

Neutral:

Cons item loader:

For me the main issue it that they are stateful. When you call sel.xpath(..) or .extract() you get a new value; when you call .add_xpath you mutate some variable in a non-obvious way. I don't like loader.url_out = Join(', ') either.

It is named "item loader", but when you work with it you should think of it as of an item with some extra voodoo, not as of data transformation pipeline.

Another issue is #578 - as Nicolas said, there are item loaders, selectors and link extractors, and they can do basically the same. ld.get_xpath looks awfully like selectors. It is not good we have so many slightly different ways to do a basic task.

The third issue is complexity - I agree with @nyov here.

It is good this discussion is started before we merged extract_first() PR or promoted item loaders from contrib :)

nyov commented 10 years ago

take_first(sel.xpath('...')) vs ld.get_xpath('...', TakeFirst()) is a question of style [...]

Not exactly, which is my other grief. The first looks ugly but would allow arbitrary xpaths passed, the second's xpath 'base' is locked in at ItemLoader instancing time in loader context and can't be re-set in it's lifetime, e.g. for traversing loops, partitioning a page into sections by parent tables or such.

And unless you need l.load_item(), and only use l.get_xpath() or such, you're just pulling along an instanced but never populated item for the loaders lifetime.

Another thought experiment...

  1. Create a new Extractor class which can use Processors (deprecating ItemLoader's input processing and maybe all the XpathItemLoader functionality since it's unification?)
  2. Selector's extract() instead exposes an Extractor.process() or smth., which can take Processors. (extract() would equal extract(Identity()) maybe)
  3. LinkExtractors become processors for Extractor; or subclasses?

This would give us a separation of concerns here: Selector handles Selector(Lists) Extractor handles extraction with processors LinkExtractors are special Extractor cases (processors) ItemLoaders get much more simple without the selector baggage and possibly reduced to a single 'output processor' chain? Haven't quite thought this through, but it looks neat in concept ;)

nramirezuy commented 10 years ago

Selector can be changed at any time, just need a fancy method.

>>> for sel in loader.selector.xpath('//a'):
...     loader.selector = sel
...     loader.get_xpath('.//text()')
... 
[u'Home']
[u'Index']
[u'FAQ']

The item instantiation is not needed for data extraction, I used a dict on some cases.

You can use multiple xpath on the same add_xpath call.

>>> loader.add_xpath('url', ('(//a/text())[1]', '(//a/text())[2]'))
>>> loader.load_item()
{'url': [u'Home', u'Index']}

I like to have multiple processor chains on my ItemLoaders it is handy in some cases. I don't think you need to remove them, mainly because you can just use one and simplify.

barraponto commented 10 years ago

I learned a lot on using ItemLoaders from reading this thread. I guess it could be commented in scrapy/templates/spiders/crawl.tmpl to expose more users to it.

and @pablohoffman, please, extract1 is an awful name. :-1:

pablohoffman commented 10 years ago

@barraponto so what was your suggestion?

barraponto commented 10 years ago

either extract_one or the pair get and get_all. I mean, adding numbers to methods to make their names shorter feels weird.

Digenis commented 10 years ago

Using only built in python features, I came up with an approach for a similar scenario, an idiom for all lists when I expect exactly one member. Suppose I want to:

res = sel.xpath("//*[@name='Exactly Once']").extract()
assert 1 == len(res), IndexError('Expected exactly 1 match, not %d' % len(res))
res = res[0]

I abbreviate this as:

res, = sel.xpath("//*[@name='Exactly Once']").extract()

Slice or raise exception. Of course this doesn't apply everywhere but it covers many variations of this issue in one line. Unfortunately it doesn't solve this functionally but if it works I stop here. More idiomatic workarounds follow:

Sometimes I can just star the extracted list when passing it to a function that expects exactly one argument.

text_size = len(*sel.xpath("//*[@name='Exactly Once']").extract())

Functional assertions can be abbreviated too (but most don't like the following):

# Because int() optionally accepts a second argument
num = int((lambda i: i)(*sel.xpath("//*[@name='Exactly Once']").extract()))

Next subject, get() method with default value.

maybe_node0 = (sel.xpath("//*[@name='Exactly Once']").extract() or [None])[0]
maybe_node1 = (sel.xpath("//*[@name='Exactly Once']").extract()[1:] or [None])[0]

Speaking for my self, in cases where I just try to avoid multi line try/except and if/else constructs I don't feel like reinventing the wheel and I just use slices. Do such idioms satisfy anyone else?

nramirezuy commented 10 years ago

@Digenis I like your approach, it is even better than extract_one() because you can select the element you want on the list.

>>> a, = [][:1] or [None]
>>> a
>>> a, = [1, 2][:1] or [None]
>>> a
1
>>> a, = [1, 2, 3][1:2] or [None]
>>> a
2
nyov commented 10 years ago
>>> a, = []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: need more than 0 values to unpack

Ok, so we could write

res, = sel.xpath("//*[@name='DoesNotExist']").extract() or [None]

But then we have a ValueError again for multiple list items. So we need at least

res, = sel.xpath("//*[@name='MultipleItems']").extract()[:1] or [None]

Or the other one

maybe = (sel.xpath("//*[@name='DoesNotExist']").extract() or [None])[0]

This isn't so bad, but .extract()[:1] or [None] vs. .extract1() is still a difference. Besides having to remember to do list unpacking on the left hand side.

Do such idioms satisfy anyone else?

As @pablohoffman said, this is a method which is used very frequently in spider code, and I'd still prefer a shorthand instead of appending [:1] or [None] all the time.

Now if extract()/re() would already return [None] instead of an empty list, this could be shortened to extract()[0] or extract()[-1:][0]... which I might see as best solution yet, but maybe there'd be other issues with doing that.

wsdookadr commented 10 years ago

Has anyone looked at the pull request #624 ? It passes all tests and builds upon the work from #572

kmike commented 10 years ago

@wsdookadr yes, #624 looks good, but we haven't decided yet if the feature itself is good or not.