scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Migrating selectors to use parsel #1409

Closed eliasdorneles closed 8 years ago

eliasdorneles commented 8 years ago

Hey, folks!

So, here is my first stab at porting selectors to use Parsel.

~~I'm not very happy about the initialization of the self._parser attribute which is used to build the LxmlDocument instance for the response, that forced me to import _ctgroup from parsel. I'd love to hear any suggestions about how to handle that.~~

What do you think? Thank you!

kmike commented 8 years ago

Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components.

dangra commented 8 years ago

If we proceed with cache removal I think we should at least fix Scrapy built-in components.

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

kmike commented 8 years ago

do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.

yep, that's it

eliasdorneles commented 8 years ago

should I make that part of this PR?

dangra commented 8 years ago

for the code that force the selector type we just let it be.

~/src/scrapy$ ack 'Selector\(' scrapy/
scrapy/http/response/text.py:106:            self._cached_selector = Selector(self)
scrapy/selector/unified.py:50:class Selector(object_ref):
scrapy/spiders/feed.py:72:            selector = Selector(response, type='xml')
scrapy/spiders/feed.py:76:            selector = Selector(response, type='html')
scrapy/linkextractors/sgml.py:130:            sel = Selector(response)
scrapy/linkextractors/lxmlhtml.py:68:        html = Selector(response)
scrapy/linkextractors/lxmlhtml.py:98:        html = Selector(response)
scrapy/utils/iterators.py:40:        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
scrapy/utils/iterators.py:52:        xs = Selector(text=nodetext, type='xml')
dangra commented 8 years ago

@eliasdorneles yes, I think so if it pass tests with that simple change.

eliasdorneles commented 8 years ago

@dangra talking about tests, any idea why I'm getting failures on test_squeues.py there? :point_down: :

dangra commented 8 years ago

talking about tests, any idea why I'm getting failures on test_squeues.py there? Hide all checks

no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461

dangra commented 8 years ago

if it is passing for "precise" then it is probably a bug introduced by some dependency upgrade image

Twisted 15.3.0 was released yesterday for example

dangra commented 8 years ago

It doesn't make sense to me

>>> import sys, pickle, cPickle
>>> sys.version
'2.7.10 (default, May 26 2015, 04:16:29) \n[GCC 5.1.0]'
>>> cPickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-10-e076fdfef0cc> in <module>()
----> 1 cPickle.dumps((lambda x: x), protocol=2)

PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
>>> pickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-11-fb0cc15a50f8> in <module>()
----> 1 pickle.dumps((lambda x: x), protocol=2)

/usr/lib64/python2.7/pickle.pyc in dumps(obj, protocol)
   1372 def dumps(obj, protocol=None):
   1373     file = StringIO()
-> 1374     Pickler(file, protocol).dump(obj)
   1375     return file.getvalue()
   1376 

/usr/lib64/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/usr/lib64/python2.7/pickle.pyc in save(self, obj)
    284         f = self.dispatch.get(t)
    285         if f:
--> 286             f(self, obj) # Call unbound method with explicit self
    287             return
    288 

/usr/lib64/python2.7/pickle.pyc in save_global(self, obj, name, pack)
    746             raise PicklingError(
    747                 "Can't pickle %r: it's not found as %s.%s" %
--> 748                 (obj, module, name))
    749         else:
    750             if klass is not obj:

PicklingError: Can't pickle <function <lambda> at 0x7f48660f6848>: it's not found as __main__.<lambda>
>>> 
eliasdorneles commented 8 years ago

@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed. no idea why, though.

kmike commented 8 years ago

On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0.

dangra commented 8 years ago

@kmike it fails if you run the whole testsuite but pass if you run just that test file

eliasdorneles commented 8 years ago

okay, this is weird.

if I run (with last Twisted) just that test with: tox -r -e py27 -- tests/test_squeues.py, it passes.

but if I run the whole suite with tox -r -e py27, it fails.

ideas?

dangra commented 8 years ago

running all tests with twisted 15.2.0 passes for me

eliasdorneles commented 8 years ago

yeah, the same with 15.2.1 for me.

dangra commented 8 years ago

It fails for 15.3.0 using $ tox -e py27 tests/test_crawl.py tests/test_squeues.py

kmike commented 8 years ago

I think it can be related to https://github.com/twisted/twisted/commit/a8d8a0c89dc530554c768739e83d7a852cb17f22 - see this line.

dangra commented 8 years ago

@kmike that's for sure.

eliasdorneles commented 8 years ago

@dangra @kmike so, I've made the changes to use response.selector in link extractors.

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

I get a bunch of test failures if I try to replace that for response.selector, so I suppose it's not safe to make the assumption all responses there will have a selector for it -- should we keep it there?

dangra commented 8 years ago

The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60

@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8'))

eliasdorneles commented 8 years ago

@dangra okay, finally got rid of it. :) :tada: should I rebase and/or squash the commits?

eliasdorneles commented 8 years ago

Alright, updated parsel and the PR, I think I'm almost done now. What do you think?

eliasdorneles commented 8 years ago

Updated addressing @kmike's comments and rebased on top of current master.

dangra commented 8 years ago

tests/test_selector_csstranslator.py can be removed or replaced by tests to assert it warns.

test/selector.py can be simplified due to most tests moved to parsel, we can left flavor detection and any other response based specific test case.

after that I think it will just work on python 3 too and tests/py3-ignores.txt updated.

eliasdorneles commented 8 years ago

@dangra done! I also left the first simple test for selector to serve as base case.

eliasdorneles commented 8 years ago

@dangra updated and rebased on top of current master :)

dangra commented 8 years ago

great work @eliasdorneles !

dangra commented 8 years ago

@kmike feel free to merge when you are happy

kmike commented 8 years ago

Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling.