Closed eliasdorneles closed 8 years ago
Without LxmlDocument cache some parts of Scrapy can become less efficient. For example, LxmlLinkExtractor instantiates Selector(response), so for users which use both link extractors and selectors response will be parsed twice. Maybe there are other cases like this (maybe in user's components), I'm not sure. If we proceed with cache removal I think we should at least fix Scrapy built-in components.
If we proceed with cache removal I think we should at least fix Scrapy built-in components.
do you mean replacing calls like html = Selector(response)
by html = response.selector
? seems quite easy.
do you mean replacing calls like html = Selector(response) by html = response.selector? seems quite easy.
yep, that's it
should I make that part of this PR?
for the code that force the selector type
we just let it be.
~/src/scrapy$ ack 'Selector\(' scrapy/
scrapy/http/response/text.py:106: self._cached_selector = Selector(self)
scrapy/selector/unified.py:50:class Selector(object_ref):
scrapy/spiders/feed.py:72: selector = Selector(response, type='xml')
scrapy/spiders/feed.py:76: selector = Selector(response, type='html')
scrapy/linkextractors/sgml.py:130: sel = Selector(response)
scrapy/linkextractors/lxmlhtml.py:68: html = Selector(response)
scrapy/linkextractors/lxmlhtml.py:98: html = Selector(response)
scrapy/utils/iterators.py:40: yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
scrapy/utils/iterators.py:52: xs = Selector(text=nodetext, type='xml')
@eliasdorneles yes, I think so if it pass tests with that simple change.
@dangra talking about tests, any idea why I'm getting failures on test_squeues.py
there? :point_down: :
talking about tests, any idea why I'm getting failures on test_squeues.py there? Hide all checks
no, but it started to happen on a previous but very unlikely commit build https://travis-ci.org/scrapy/scrapy/jobs/74296461
if it is passing for "precise" then it is probably a bug introduced by some dependency upgrade
Twisted 15.3.0 was released yesterday for example
It doesn't make sense to me
>>> import sys, pickle, cPickle
>>> sys.version
'2.7.10 (default, May 26 2015, 04:16:29) \n[GCC 5.1.0]'
>>> cPickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError Traceback (most recent call last)
<ipython-input-10-e076fdfef0cc> in <module>()
----> 1 cPickle.dumps((lambda x: x), protocol=2)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
>>> pickle.dumps((lambda x: x), protocol=2)
---------------------------------------------------------------------------
PicklingError Traceback (most recent call last)
<ipython-input-11-fb0cc15a50f8> in <module>()
----> 1 pickle.dumps((lambda x: x), protocol=2)
/usr/lib64/python2.7/pickle.pyc in dumps(obj, protocol)
1372 def dumps(obj, protocol=None):
1373 file = StringIO()
-> 1374 Pickler(file, protocol).dump(obj)
1375 return file.getvalue()
1376
/usr/lib64/python2.7/pickle.pyc in dump(self, obj)
222 if self.proto >= 2:
223 self.write(PROTO + chr(self.proto))
--> 224 self.save(obj)
225 self.write(STOP)
226
/usr/lib64/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288
/usr/lib64/python2.7/pickle.pyc in save_global(self, obj, name, pack)
746 raise PicklingError(
747 "Can't pickle %r: it's not found as %s.%s" %
--> 748 (obj, module, name))
749 else:
750 if klass is not obj:
PicklingError: Can't pickle <function <lambda> at 0x7f48660f6848>: it's not found as __main__.<lambda>
>>>
@dangra it's definitely weird, but I tested locally fixing Twisted version to 15.2.1 and it passed. no idea why, though.
On Travis it is failing with AssertionError, but we're expecting ValueError in test. I have no idea how can it depend on installed requirements. For me this test passes locally with twisted 15.3.0.
@kmike it fails if you run the whole testsuite but pass if you run just that test file
okay, this is weird.
if I run (with last Twisted) just that test with: tox -r -e py27 -- tests/test_squeues.py
, it passes.
but if I run the whole suite with tox -r -e py27
, it fails.
ideas?
running all tests with twisted 15.2.0 passes for me
yeah, the same with 15.2.1 for me.
It fails for 15.3.0 using $ tox -e py27 tests/test_crawl.py tests/test_squeues.py
I think it can be related to https://github.com/twisted/twisted/commit/a8d8a0c89dc530554c768739e83d7a852cb17f22 - see this line.
@kmike that's for sure.
@dangra @kmike so, I've made the changes to use response.selector
in link extractors.
The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60
I get a bunch of test failures if I try to replace that for response.selector
, so I suppose it's not safe to make the assumption all responses there will have a selector for it -- should we keep it there?
The only place still using LxmlDocument is here: https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py#L60
@elias replace it by the corr espondent direct call to lxml parser following re-encoding best practices from parsel (i.e.: body_as_unicode().encode('utf8'))
@dangra okay, finally got rid of it. :) :tada: should I rebase and/or squash the commits?
Alright, updated parsel and the PR, I think I'm almost done now. What do you think?
Updated addressing @kmike's comments and rebased on top of current master.
tests/test_selector_csstranslator.py can be removed or replaced by tests to assert it warns.
test/selector.py can be simplified due to most tests moved to parsel, we can left flavor detection and any other response based specific test case.
after that I think it will just work on python 3 too and tests/py3-ignores.txt updated.
@dangra done! I also left the first simple test for selector to serve as base case.
@dangra updated and rebased on top of current master :)
great work @eliasdorneles !
@kmike feel free to merge when you are happy
Great job @eliasdorneles! Also, thanks @umrashrf for getting the ball rolling.
Hey, folks!
So, here is my first stab at porting selectors to use Parsel.
~~I'm not very happy about the initialization of the
self._parser
attribute which is used to build theLxmlDocument
instance for the response, that forced me to import_ctgroup
from parsel. I'd love to hear any suggestions about how to handle that.~~What do you think? Thank you!