Open lassik opened 3 months ago
I know they're all built using conf.researchr.org, but I don't know any more effective or automatic way to get that data than what you're describing.
This is now cloaseable IIUC b/c @lassik found a good way to handle this.
Yes. I emailed the conf.researchr.org maintainers and they suggested a built-in feature that helps us do a decent job.
I'll take care of this.
It' s harder than I thought. The static HTML served by conf.researchr.org still depends quite heavily on JavaScript and fetches various files from the ICFP site. It is a local copy in name only. The good thing is that it only fetches static assets from the server. AFAICT it does not make database queries.
Nevertheless, I'd prefer that someone write a program that converts the static HTML to JS-free HTML. This doesn't look too hard, but we are all pressed for time.
@jasonhemann If you have a few spare cycles, we could try writing such a scraper together. If I have to do it by myself, it will take an unbounded amount of time.
Do you know what programming language conf.researchr.org is written in? I can't find the source. We could write the scraper in that language in the hopes that other people will maintain it. I envision a 500-line Python script.
This should not be hard if you use fake browser like Puppeteer that open the page and run any JavaScript. I asked chatGPT:
Can you use Puppetier to scrap rendered HTML from a website that use JavaScript
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the website
await page.goto('https://example.com', {
waitUntil: 'networkidle2' // Wait for the network to be idle
});
// Get the rendered HTML content
const content = await page.content();
console.log(content);
await browser.close();
})();
This is a relatively simple task, so you should be able to use ChatGPT to generate the code for you with a bit of modifications.
Starting in 2020, the Workshop's website is the ICFP subsite at
icfpYY.sigplan.org/home/scheme-YYYY
.@jasonhemann Do you know of a standard way to scrape local copies of these sites, or should we just do manual labor?