scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.29k stars 1.41k forks source link

multiple templates per item #105

Open tpeng opened 9 years ago

tpeng commented 9 years ago

Motivation

Quite often we need parse multiple pages for single one item. e.g. for http://www.diningcity.com/nl/amsterdam/restaurant_vondelpark3, the name, address are quite straightforward to parse, but to parse phone and reviews we need to follow the links and parse the data on those pages and merge them with the other data we crawled on the first page.

The idea of this proposal is to be able annotate multiple pages in portia and lear to crawl one item from multiple templates.

portia_1

portia_2

portia_3

Proposals

  1. Introduce hierarchy to the templates (i.e. HtmlPage), so portia will know which templates will be used to build a extraction tree. for instance when annotate the phone link above, portia will know the template from this link will be the child template of the first template.
  2. Introduce the operator for the data crawled on the children templates, it can be merge into the parent page or append to parent (e.g. for the pagination https://github.com/scrapinghub/portia/issues/107)

t1

t2

Issues

  1. Portia UI need update to know whether annotation is end or not when click on the link.
almeidaf commented 9 years ago

if it can be added a function on the annotation tool with a cmd like to follow link and merge with current item, you could then just annotate all the pages (overview/photos/review/contact) has an individual template but in the main template you would specify what links to follow and what template to use.

so that: main_template: http://www.diningcity.com/nl/amsterdam/restaurant_vondelpark3 and annotate the links to other pages as "follow_link_merge"

sub_template_1: http://www.diningcity.com/nl/amsterdam/restaurant_vondelpark3/photos#main sub_template_2: http://www.diningcity.com/nl/amsterdam/restaurant_vondelpark3/reviews#main sub_template_3: http://www.diningcity.com/nl/amsterdam/restaurant_vondelpark3/contacting#main

the sub templates should only extract if requested from a main_template to avoid duplicates or else you could add "/photos" to "exclude patterns" and add function to allow url if requested from a template

the follow_link_merge could be in the fields box and its problably not necessary to specify which template to use.

i use to have a php script with this sort of function have no idea how this can be done with scrapy/portia

tpeng commented 9 years ago

ideally we should be able to do it in portia UI. but idea is basically same as your. thanks @almeidaf btw can you give a bit more details about your php script?

almeidaf commented 9 years ago

the script was called "mine the web" i still have it but since the website went down i cant install it cause there no way to validate the license (its encoded with ioncube) but if you want to try your luck. the follow link merge is documented in the help files script was bought in 2004 so its kinda old :)

file: https://mega.co.nz/#!U8l2zDaD!f4jLER7pEsOQjHNBRoss32qB9_9qJyI--7udijvZ1WU license key: https://mega.co.nz/#!k81TUDqI!K5IU8hsVtMjAUGqhIArIXBji4qg9jFsSySn5bhDsrZQ

dunno if theres any problem sharing the links on public i will delete them if needed

MitPandya commented 7 years ago

Hi, is this issue still open, If yes, can I use this for Google Summer of Code 2017? Awaiting for your reply.

ruairif commented 7 years ago

Hi @MitPandya yes this issue is still open but the implementation has changed now. The first step is to modify slybot to have a concept of a link type. This link will be a page to enter after an item has been extracted from an initial page. The extracted item should be loaded into the request meta and when items are extracted from the new page their fields should be nested or added to the item extracted from the first page.

For all of this to work the ItemProcessor and item fields will need a concept of this link type. The extractor will also need to handle yielding a request instead of an item and also merging a linked item with the parent item. Once this is working adding the functionality to the UI shouldn't be too difficult.

This feature will add some much needed functionality to Portia so it would be great if you can implement it.