scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

HTML code extraction from node is not working #228

Closed osjerick closed 2 years ago

osjerick commented 3 years ago

I've installed Scrapy into a new environment recently and now, when trying to get the HTML source of a node, the selector returns the node and the subsequent code in the whole source.

Note: I installed parsel with Scrapy into conda environments using the conda-forge channel.

Current behavior:

scrapy shell https://picsart.com/design-templates
In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1><div class="banner__note___2usDR" data-test="primary-description"><p>Picsart’s free templates are easy to use and fitting for any special occasion.</p></div></div><div class="banner__buttonsContainer___1rJTU"><div class="actionButton__blueAction___2HAym actionButton__actionContainer___26zdu"><a class="actionButton__action___1-L0i root-0-2-30 primary-0-2-31 responsive-0-2-33" data-test="primary-button" href="/create/editor">Try Templates</a></div></div></div></div><div class="banner__imageBlock___3OBoa"><div class="banner__imageHolder___2QGdJ"><picture class="root-0-2-42 banner__image___LKEfe"><source type="image/webp" media="(min-width: 1365px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&amp;to=min&amp;r=1365"></source><source type="image/webp" media="(min-width: 1023px)" srcset="https://cdn130.picsart.com/46322149886354257770.png?type=webp&amp;to=min&amp;r=1023"></source>
... I CROPPED THE OUTPUT ...'

The env is composed by:

Previous behavior:

scrapy shell https://picsart.com/design-templates
In [1]: response.css('h1').get()
Out[1]: '<h1 class="banner__title___2T69O" data-test="primary-title">Anything-But-Basic Design Templates</h1>'

This env is composed by:

borealbusiness commented 3 years ago

Same problem here with Python 3.9.2

elacuesta commented 3 years ago

This is strange, I cannot reproduce either in my local Linux Mint environment:

$ scrapy version -v
Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.9.6 (default, Jul  5 2021, 11:47:27) - [GCC 7.5.0]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021)
cryptography : 3.4.8
Platform     : Linux-4.15.0-147-generic-x86_64-with-glibc2.27

nor in a docker container with Python 3.8.10:

# scrapy version -v
Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.8.10 (default, Jun 23 2021, 15:19:53) - [GCC 8.3.0]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1l  24 Aug 2021)
cryptography : 3.4.8
Platform     : Linux-4.15.0-147-generic-x86_64-with-glibc2.2.5

nor in my MBP:

$ scrapy version -v
Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.10
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.2.0
Python       : 3.8.6 (v3.8.6:db455296be, Sep 23 2020, 13:31:39) - [Clang 6.0 (clang-600.0.57)]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021)
cryptography : 3.4.7
Platform     : macOS-10.15.7-x86_64-i386-64bit

Downgrading to lxml==4.5.2 does not change the output.

Could you provide more information to reproduce?

borealbusiness commented 3 years ago
$ scrapy version -v
Scrapy       : 2.5.0
lxml         : 4.6.3.0
libxml2      : 2.9.12
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.9.2 (default, Feb 28 2021, 17:03:44) - [GCC 10.2.1 20210110]
pyOpenSSL    : 20.0.1 (OpenSSL 1.1.1k  25 Mar 2021)
cryptography : 3.3.2
Platform     : Linux-5.10.0-8-amd64-x86_64-with-glibc2.31

I am quite sure the problem occured after i upgraded to the new debian stable bullseye

borealbusiness commented 3 years ago

Could be linked to that : https://gitlab.gnome.org/GNOME/libxml2/-/issues/255 Edit : definitly linked to that issue, downgrading to libxml 2.9.10 fixes the problem

nraffuse commented 3 years ago

Thanks! that fixed it for me. Conda 4.10.3 erroneously selected libxml2 v2.9.12 Locking my environment to 2.9.10 (as lxml has done) solved the issue.

You can see in the May 18, 2021 series of commits for lxml's Makefile, that 2.9.12 was tested and then promptly reverted.

Buratinator commented 3 years ago

I can confirm both the issue and the fix (i.e. downgrading libxml2 to 2.9.10). For me the issue was caused either by upgrading ipykernel from 5.3.4 to 6.2.0 or installing eli5 under conda in a WSL2 setting (history file cot corrupt so not sure which).

wRAR commented 2 years ago

AFAICS this is fixed in newer libxml2 so I don't think this should stay open.