hyphacoop / api.distributed.press

https://distributed.press
GNU Affero General Public License v3.0
77 stars 8 forks source link

Add ability to crawl a site instead of uploading. #82

Open RangerMauve opened 2 months ago

RangerMauve commented 2 months ago

Maybe use something like this JSDOM based crawler to download all the files? https://crawlee.dev/api/jsdom-crawler

fauno commented 2 months ago

I was going to ask this:

However, if the target website requires JavaScript to display the content, you might need to use PuppeteerCrawler or PlaywrightCrawler instead, because it loads the pages using full-featured headless Chrome browser.

But also, what about using WARC so we can integrate with WebRecorder tools? We've been archiving sites and we don't have where to upload them, so DP would be great :D