First year scraper is outdated - Githubissues

metakgp / chillzone

Find a place to chill during class hours in IIT KGP

https://chill.metakgp.org

GNU General Public License v3.0

22 stars 27 forks source link

First year scraper is outdated #85

Open shikharish opened 3 months ago

shikharish commented 3 months ago

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

harshkhandeparkar commented 3 months ago

Due the change in curriculum of first years, the first-year-scraper is outdated and needs to be updated.

Can you add the details of what has changed?

harshkhandeparkar commented 3 months ago

@shikharish ?

shikharish commented 3 months ago

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

harshkhandeparkar commented 3 months ago

The format of the pdf from which first years timetable is scraped has completely changed. So the logic of scraping needs to be changed.

Can you send the new PDF?

shikharish commented 3 months ago

proffapt commented 1 month ago

So, chillzone doesn't have proper data at the moment?

shikharish commented 1 month ago

No.

proffapt commented 1 month ago

Oh, so, when will we need to make the required changes?

harshkhandeparkar commented 1 month ago

Now if possible but chillzone uses very outdated pdf parsing libraries that require python 3.7. @shikharish is there no alternative?

shikharish commented 1 month ago

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

harshkhandeparkar commented 1 month ago

the problem is not that it uses outdated parsing libraries. this year the whole format of the pdf was changed so we need to write the new logic of the scraper from scratch.

The format change is fine, we can do it. We should focus on getting rid of the outdated libraries first. This is unmaintainable. Are there any alternatives?

shikharish commented 1 month ago

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

harshkhandeparkar commented 1 month ago

Not one I could find. camelot-py parses pdf to xlsx directly which makes scraping easier, while other scrapers can covert to plain-text/html.

Can we use something like libreoffice to convert the pdf to a spreadsheet and then parse that using a recent library?

shikharish commented 1 month ago

libreoffice cant do that afaik. we can try using api of some online tool like ilovepdf, smallpdf....

harshkhandeparkar commented 1 month ago

What about onlyoffice?

shikharish commented 1 month ago

dont think so.

harshkhandeparkar commented 1 month ago

Hmm, in that case we should write a Dockerfile to run the scraper in.