There's a neat little package autoscraper that allows to quickly build no-code web extractors.
You take a page with known content.
Say what text from it you need and what alias to bind it to. For example, { "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
Fit the model to your known page and known data.
It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a model object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.
Now you can just predict that data from new URLs/DOMs.
I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.
May be prototyped as another CLI on top of heap, html, and image exporting here.
There's a neat little package autoscraper that allows to quickly build no-code web extractors.
{ "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
model
object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.
May be prototyped as another CLI on top of heap, html, and image exporting here.