Open shikharish opened 3 months ago
Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.
Can you add the details of what has changed?
@shikharish ?
The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.
The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.
Can you send the new PDF?
So, chillzone doesn't have proper data at the moment?
No.
Oh, so, when will we need to make the required changes?
Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative?
the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.
the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.
The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives?
Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.
Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.
Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library?
libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf....
What about onlyoffice?
dont think so.
Hmm, in that case we should write a Dockerfile to run the scraper in.
Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.