lorey / mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples
https://pypi.org/project/mlscraper/
1.31k stars 89 forks source link

Find better selectors #24

Open lorey opened 2 years ago

lorey commented 2 years ago

Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.

Maybe there's a heuristic for good selectors. An idea: What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.

entrptaher commented 1 year ago

Converting a dynamic selector to a * based selector may work.

decentralizeCss(`h2.heading--2eONR.heading-2--1OnX8.title--3yncE.block--3v-Ow`)

// h2[class*="heading--"][class*="heading-2--"][class*="title--"][class*="block--"]

Ofc it depends on the implementation of this function, I tested it a few times and it worked.

lorey commented 1 year ago

I think with the soupsieve implementation, that's computationally very expensive. For something java-based with xslt, it might make more sense.