scalingexcellence / scrapybook

Scrapy Book Code
http://scrapybook.com/
475 stars 209 forks source link

Issue on chater 3 #23

Closed OscarDgrouch closed 7 years ago

OscarDgrouch commented 7 years ago

This is related to chapter 3, the book instructs me to run on Addess Item xpath => //[@itemtype="http://schema.org/Place"][1]/text(). However I'm getting this: In [27]: response.xpath('//[@itemtype="http://schema.org/Place"][1]/text()').extract() Out[27]: [u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ', u'\n ']

When I run it with out the text () I get this: [u'\n West Hampstead, London', u'\n Angel, London', u'\n Tower Bridge, London', u'\n Canary Wharf, London', u'\n Whitechapel, London', u'\n Chelsea, London', u'\n Hackney, London', u'\n Stratford, London', u'\n Canary Wharf, London', u'\n Chiswick, London', u'\n Highbury, London', u'\n Notting Hill, London', u'\n Brixton, London', u'\n Greenwich, London', u'\n Canary Wharf, London', u'\n Battersea, London', u'\n South Kensington, London', u'\n Camden, London', u'\n Wimbledon, London', u'\n West Hampstead, London', u'\n West Hampstead, London', u'\n Elephant And Castle, London', u'\n Angel, London', u'\n Heathrow, London', u'\n Bayswater, London', u'\n Seven Sisters, London', u'\n Angel, London', u'\n Angel, London', u'\n Battersea, London', u'\n Bethnal Green, London'] I tried paying with it and I came up with this: In [32]: response.xpath('//*[@itemtype="http://schema.org/Place"][1]/span/text()').extract() Out[32]: [u'West Hampstead, London', u'Angel, London', u'Tower Bridge, London', u'Canary Wharf, London', u'Whitechapel, London', u'Chelsea, London', u'Hackney, London', u'Stratford, London', u'Canary Wharf, London', u'Chiswick, London', u'Highbury, London', u'Notting Hill, London', u'Brixton, London', u'Greenwich, London', u'Canary Wharf, London', u'Battersea, London', u'South Kensington, London', u'Camden, London', u'Wimbledon, London', u'West Hampstead, London', u'West Hampstead, London', u'Elephant And Castle, London', u'Angel, London', u'Heathrow, London', u'Bayswater, London', u'Seven Sisters, London', u'Angel, London', u'Angel, London', u'Battersea, London', u'Bethnal Green, London']

**My questions which xpath expresion is right????? And why I'm getting an array instead of single values???

lookfwd commented 7 years ago

Hello, I see what you mean. I can confirm that:

scrapy shell http://web:9312/properties/index_00000.html
>>> response.xpath('//*[@itemtype="http://schema.org/Place"][1]/text()').extract()
[u'\n  ', ... u'\n  ', u'\n  ']
>>> response.xpath('//*[@itemtype="http://schema.org/Place"][1]/span/text()').extract()
[u'West Hampstead, London', ... , u'Bethnal Green, London']

The only issue is that in the context of Chapter you want to be crawling individual pages e.g.

scrapy shell http://web:9312/properties/property_000000.html
>>> response.xpath('//*[@itemtype="http://schema.org/Place"][1]/text()').extract()
[u'West Hampstead, London']

In Chapter 5, page 99 you can find how to crawl the index pages directly with relative XPaths (see also here).

P.S. Sorry for the typo - they are mentioned as "Relevant XPath" in that page.