scrapinghub / portia

Visual scraping for Scrapy
BSD 3-Clause "New" or "Revised" License
9.3k stars 1.4k forks source link

Strange behavior of CSS selectors #694

Open eng1neer opened 7 years ago

eng1neer commented 7 years ago

I've noticed a strange behavior of selectors in Portia. Here's a zipped video of my workflow:

css-selector-bug.zip

So basically what I do is create a project on scrapinghub.com, create a spider for the https://www.kurtgeiger.com/women/shoes/trainers/slip-ons/logical-silver-glitter-kg-kurt-geiger and then select an element which has a #product-sku id. At first Portia gives it #details-dl > dd:nth-child(10) selector which is strange because a unique id exists for that element. But after creating another annotation, a first element's selector changes to the correct #product-sku.

ruairif commented 7 years ago

The way selectors are generated there is a preference to creating longer selectors but when you added a second annotation the long selector was no longer valid due to the closeness of the elements in the page. When you had just one annotation the algorithm was going to search the HTML tree to a depth of ~10 elements. When you add a second annotation this tells the algorithm that the annotations are going to be in a specific part of the page and the CSS selectors update to reflect this. There are actually 3 CSS selectors generated. If you look inside the data for the sample you should see that there is an annotation with Item_container: true and selector: #details-dl at the top followed by your 2 other annotations.

There is a bug here though. That annotation shouldn't change its selector after you have changed it from automatic to CSS and it shouldn't be used for the generation of the 3rd selector.

eng1neer commented 7 years ago

@ruairif Thanks for the explanation. What I find problematic here is that #details-dl > dd:nth-child(10) selector is much less robust than #product-sku is. Other products may contain another attribute count in the dl element and the selector that depends on an element order will fail while #product-sku will work just right. Is there a way to shift preference towards id selectors, make them have more priority? Any help to where I should look is appreciated.

eng1neer commented 7 years ago

Also, maybe I missed the point, but even when I select an element that is not close to the #product_sku (I select .page > header > .inner-wrap > .hide-text in the header), the behavior is the same as in my original post.

ruairif commented 7 years ago

Maybe the generation doesn't behave how I think it does then. I can't make it prioritise id selectors but I will fix the bug that causes the css selector to change. After you mark a selector as CSS then it should be static unless you manually change it