i-on-project / integration

Imports information from external systems, which is then validated, parsed, and then submitted as structured data (YAML or JSON) to a separate Github repository.
Apache License 2.0
3 stars 0 forks source link

Obtain updated PDF URLs automatically #155

Open grimord opened 3 years ago

grimord commented 3 years ago

Create simple web scraper to automatically obtain links to all ISEL PDF timetables from the programme page (it can also include official programme name and degree level - licenciatura / mestrado).

To consider:

grimord commented 3 years ago

Created a basic throwaway scraper as a practical proof of concept using Node JS. ISEL's current website doesn't include ID attributes in most elements so DOM queries will have to rely on element type + class combinations and even a little filtering through href attributes.

const fetch = require("node-fetch");
const cheerio = require("cheerio");

const LEIC = "https://www.isel.pt/cursos/licenciaturas/engenharia-informatica-e-computadores/horarios";
const LEM = "https://www.isel.pt/cursos/licenciaturas/engenharia-mecanica/horarios";
const LEC = "https://www.isel.pt/cursos/licenciaturas/engenharia-civil/horarios";
const programmes = [LEIC, LEC, LEM];

const getSchedule = async (uri) => {
  const body = await fetch(uri).then((resp) => resp.text());

  const $ = cheerio.load(body);

  let pdf_anchor = $("a[class=sizer]")
    .filter((i, data) => $(data).attr("href").endsWith("pdf"))
    .first();

  return {
    programme: $("h1[class=sizer]").text(),
    pdf: $(pdf_anchor).attr("href"),
  };
};

programmes.forEach((url) =>
  getSchedule(url).then((d) => console.log(`${d.programme} => ${d.pdf}`))
);