Open simonw opened 1 year ago
Here's a prototype I built to help me scrape through all of https://news.ycombinator.com/from?site=simonwillison.net following the more links:
diff --git a/shot_scraper/cli.py b/shot_scraper/cli.py
index 9bc48aa..eb3a80e 100644
--- a/shot_scraper/cli.py
+++ b/shot_scraper/cli.py
@@ -524,6 +524,21 @@ def accessibility(url, auth, output, javascript, timeout, log_console, skip, fai
is_flag=True,
help="Output JSON strings as raw text",
)
+@click.option(
+ 'next_',
+ "--next",
+ help="JavaScript to run to find next page",
+)
+@click.option(
+ "--next-delay",
+ type=int,
+ help="Milliseconds to wait before following --next",
+)
+@click.option(
+ "--next-limit",
+ type=int,
+ help="Maximum number of --after pages",
+)
@browser_option
@user_agent_option
@reduced_motion_option
@@ -536,6 +551,9 @@ def javascript(
auth,
output,
raw,
+ next_,
+ next_delay,
+ next_limit,
browser,
user_agent,
reduced_motion,
@@ -571,6 +589,7 @@ def javascript(
if not javascript:
javascript = input.read()
url = url_or_file_path(url, _check_and_absolutize)
+ next_count = 0
with sync_playwright() as p:
context, browser_obj = _browser_context(
p,
@@ -582,9 +601,27 @@ def javascript(
page = context.new_page()
if log_console:
page.on("console", console_log)
- response = page.goto(url)
- skip_or_fail(response, skip, fail)
- result = _evaluate_js(page, javascript)
+ result = []
+ while url:
+ response = page.goto(url)
+ skip_or_fail(response, skip, fail)
+ evaluated = _evaluate_js(page, javascript)
+ if next_:
+ result.extend(evaluated)
+ else:
+ result = evaluated
+ next_count += 1
+ if next_:
+ if next_limit is not None and next_count >= next_limit:
+ raise click.ClickException(
+ f"Reached --after-limit of {next_limit} pages"
+ )
+ url = _evaluate_js(page, next_)
+ print(url)
+ if next_delay:
+ time.sleep(next_delay / 1000)
+ else:
+ url = None
browser_obj.close()
if raw:
output.write(str(result))
I ran it like this and it worked!
shot-scraper javascript \
'https://news.ycombinator.com/from?site=simonwillison.net' \
-i /tmp/scrape.js \
--next '() => {
let el = document.querySelector(".morelink[rel=next]");
if (el) {
return el.href;
}
}' -o /tmp/all.json --next-delay 1000
Needs more thought about how things like concatenating together results from multiple pages should work.
It would also be neat if this could return a {"method": "POST", "body": "..."}
object as an alternative to returning a URL, then shot-scraper
could hit subsequent pages using other HTTP methods. Maybe persist cookies too!
I was trying to scrape some Google Maps lists of places, but didn't manage as the first page that loads is a cookie notice that triggers a navigation event when accepted / rejected that results in Error: Execution context was destroyed, most likely because of a navigation.
, but this sounds like it could solve it?
To your question, maybe it could just return JSON-LD and leave the concat to downstream?
Pagination is difficult to wrap your head around. I scrape 1000 of pages on a daily basis, and pagination is something no scraper can get right.
From the script above, --next
is supposed to get the next link. “Which” next links are we talking about?
In a nutshell, websites consist of list pages and single pages. List pages “list” the pages a website has, and single pages are the “final” page.
For this type of scraping, think of any list page (IMDb genre pages, Amazon shoes pages), then a “next” is fine. The list page is the final page.
flowchart LR;
start-url --> list-page-1
start-url --> list-page-2
start-url --> list-page-3
But, in reality, list pages have a very different purpose. Lists are a “summary” of a page, not the actual data scrapers want. List pages are designed to “entice” users to click. It doesn't have the actual data a scraper wants (see case below).
flowchart LR;
start-url --> list-page-1-->single-pages-11[single page 1]
list-page-1-->single-pages-12[single page 2]
list-page-1-->single-pages-13[single page 3]
start-url --> list-page-2-->single-pages-21[single page 1]
list-page-2-->single-pages-22[single page 2]
list-page-2-->single-pages-23[single page 3]
To sum it up, allowing shot-scraper to “follow” links, one has to think about 2 types of links to be followed: pagination links (1,2,3, next etc.), and list items (card, article, col etc.). It also helps to actually call it that:
shot-scraper https://amazon.com/shoes --pagination a[label=next] --follow a.items
On the list pages, you got: name, category, update, number of downloads rounded to near 1000 and favorites rounded to near 1000
Let's say you want the growth-rate. On the list page it's listed as 227K, but when you click and visit the actual page it says 226,828
The difference between scraping the list page and the actual page is it takes 1000 downloads before you notice a change. In real life, it means you won't be able to catch “trending” AI models.
Another example: you want to know the sentiment about an AI model. On list pages, you have favorites. That doesn't say much about an AI model, a person can favorite to get updates, view it later, likes the idea, interested in how it works etc. “favorite” doesn't really say much about an AI model.
On the actual page, you have a community tab, which reveals far more about sentiment. The ratio between open and closed issues, for example. 800 open issues and 1 closed one tells a different story then 800 open/1000 closed, 0 open/800 closed or even 800 closed/last update: 1980
Another example of a list page not having everything you need is rottentomatoes.com.
On the list page, you got title, tomato meter, audience score, openings date.
On the actual page you got MPA rating (G, PG, PG-13), genre, duration, critics consensus, recommendation/similar movies, where to watch, language, synopsis, cast.
Even if you don't require anything complicated (genre, for example), shot-scraper still needs to visit the actual page to get the info, since the list page lacks pretty much anything.
Most commonly used pagination types
auto (detect which type of pagination is used)
Link (<a href="https://example.com">)
Scripted link (<a href="javascript:window.location="https://example.com">)
Attribute link (<a data-link="https://example">)
Text link(<div>link: https://example.com</div>)
Link from any script (window.location=, window.open)
Click multiple times on next/more button ([Next page][Load More])
Click once on multiple buttons ([1] [2], [3])
Would be neat if you could do pagination when running
shot-scraper javascript
- by running extra JavaScript that returns the URL of the next page to visit.