scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 273 forks source link

Unable to extract some fields ? #10

Closed javadi82 closed 12 years ago

javadi82 commented 12 years ago

Is it possible to extract the address: "15, Vishal Est, Opp Bharat Party Plt,amraiwadi, Ahmedabad-380026, Ahmedabad, 380026" from http://indisearch.com/s-v-machine-tools,-rajkot/74709

Because, the address does not come under an HTML element ?

I am aware of editing the template.json file. So, if this can be done please let me know.

Thanks.

shaneaevans commented 12 years ago

Yes, this is possible, using an undocumented feature of scrapely. You can insert tags yourself to surround the HTML code you want to extract by adding the "generated"=true attribute to the scrapely annotation.

Here is what I did, first make a template:

s = Scraper()
url = "http://indisearch.com/s-v-machine-tools,-rajkot/74709"
data = {'address': "15, Vishal Est, Opp Bharat Party Plt,amraiwadi, Ahmedabad-380026,"}
s.train(url, data)
s.tofile(open("templates.json", 'w'))

Now open the file and create "ins" tags (actual name doesn't matter) surrounding the area to extract and move the existing data-scrapy-annotate for the addres to that tag. Also add generated="true" to tell scrapely that this tag isn't in the original page.

Here is the template I ended up with.

Let's test it:

s = Scraper.fromfile(open("templates.json"))
s.scrape(url)
[{u'address': [u'15, Vishal Est, Opp Bharat Party Plt,amraiwadi, Ahmedabad-380026, <a href="/city/ahmedabad/17" class="category-text">Ahmedabad</a>, 380026']}]

Check it works on other similar pages on that site:

s.scrape("http://indisearch.com/e-business-experts-international/85809")
Out[27]: [{u'address': [u'834/8, Aziz Abad Fedral, B Area Karachi,75950, Pakistan, Karachi, <a href="/city/other-than-india/119" class="category-text">Other Than India</a>, 0']}

of course, you can put the closing ins tag in front of the a tag too if you don't need the final linked city. You'll probably need a second template for the pages that do not have the "Is this your business?" section.

javadi82 commented 12 years ago

Thanks! That solved the problem.

Also, there seems to be another parameter called "variant". Eg:`

` Can you please elaborate on it's use ? Thanks.