scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.28k stars 1.41k forks source link

How exactly selection mode: automatic works? #683

Closed eng1neer closed 7 years ago

eng1neer commented 7 years ago

I haven't found answers in the documentation, so I'm asking here.

I'm using Portia to extract products from an ecommerce store. Specifically, I've created an annotation for a collection of links on a page that form breadcrumbs of this page.

screen shot 2017-01-11 at 16 10 24

When I'm using a default ("Automatic") selection mode it extracts only the first (despite the fact that there are multiple) link content ("Home"). However after switching to "CSS selector" (without changing the selector supplied by default) mode it starts extracting the whole list of 4 links.

(I'm using portiacrawl to test the spider).

What is the exact difference between these 2 modes and why they might result in different number of values scraped?

eng1neer commented 7 years ago

Here's another example of a confusing behavior. heading annotation is defined by selecting multiple fields (h1 headings in essence). 3 elements are correctly highlighted. However, the "extracted items" panel shows only 2 of them (2nd and 3rd). And this is regardless of the selector mode (I tried automatic and manual CSS with "h1" selector)

screen shot 2017-01-12 at 15 16 05

I wonder, what's the reason for this behavior?

ruairif commented 7 years ago

The automatic annotation mode trains the extractor using scrapely so sometimes it is unable to find the data correctly depending on the structure of the page. CSS mode find the elements by building the page DOM and extracts using that. If you have only a single annotation that uses a CSS selector, nothing is extracted. If you were to add another annotation that doesn't use a css selector then it would work.

eng1neer commented 7 years ago

If you have only a single annotation that uses a CSS selector, nothing is extracted. If you were to add another annotation that doesn't use a css selector then it would work.

Okay, but it's still strange that CSS selector extracted 2 elements instead of 3 (or zero according to your explanation)?

eng1neer commented 7 years ago

And one more question. Suppose I need to build a spider that uses only "deterministic" CSS selectors. Does this mean that I need to include a "dummy" automatic selector to make this work?

ruairif commented 7 years ago

You could definitely make it work that way