Scrape years 2020 and up

schemeorg-community / schemeworkshop.org

The schemeworkshop.org website (without papers)

https://www.schemeworkshop.org

1 stars 0 forks source link

Scrape years 2020 and up #3

Open lassik opened 3 months ago

lassik commented 3 months ago

Starting in 2020, the Workshop's website is the ICFP subsite at icfpYY.sigplan.org/home/scheme-YYYY.

@jasonhemann Do you know of a standard way to scrape local copies of these sites, or should we just do manual labor?

jasonhemann commented 3 months ago

I know they're all built using conf.researchr.org, but I don't know any more effective or automatic way to get that data than what you're describing.

jasonhemann commented 2 months ago

This is now cloaseable IIUC b/c @lassik found a good way to handle this.

lassik commented 2 months ago

Yes. I emailed the conf.researchr.org maintainers and they suggested a built-in feature that helps us do a decent job.

I'll take care of this.

lassik commented 1 week ago

It' s harder than I thought. The static HTML served by conf.researchr.org still depends quite heavily on JavaScript and fetches various files from the ICFP site. It is a local copy in name only. The good thing is that it only fetches static assets from the server. AFAICT it does not make database queries.

Nevertheless, I'd prefer that someone write a program that converts the static HTML to JS-free HTML. This doesn't look too hard, but we are all pressed for time.

lassik commented 1 week ago

@jasonhemann If you have a few spare cycles, we could try writing such a scraper together. If I have to do it by myself, it will take an unbounded amount of time.

Do you know what programming language conf.researchr.org is written in? I can't find the source. We could write the scraper in that language in the hopes that other people will maintain it. I envision a 500-line Python script.

jcubic commented 1 week ago

This should not be hard if you use fake browser like Puppeteer that open the page and run any JavaScript. I asked chatGPT:

Can you use Puppetier to scrap rendered HTML from a website that use JavaScript

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Navigate to the website
  await page.goto('https://example.com', {
    waitUntil: 'networkidle2' // Wait for the network to be idle
  });

  // Get the rendered HTML content
  const content = await page.content();

  console.log(content);

  await browser.close();
})();

This is a relatively simple task, so you should be able to use ChatGPT to generate the code for you with a bit of modifications.