knub / onehundredandtwenty

Course Planning System for HPI masters
15 stars 19 forks source link

Parser changes for new HPI website #34

Closed gersseba closed 8 years ago

gersseba commented 8 years ago

changes to support new website full support of utf-8 characters simple guess for short form of course names

knub commented 8 years ago

I am not sure, whether this is a good idea to use the parser. We did this at the very beginning of 180, but quickly found out, that everything needs to be hand-checked anyways. First, because the website is subject to many small changes and potential parser errors. Second, and most important: The website is not the authoritative source, the PDF documents are. The changes are not always in sync, and the data on the website is not always copied correctly.

gersseba commented 8 years ago

I see no reason not to use it. I don't know if you have tried adding the courses manually. I have, and it's a pain in the ass, and very error prone. For this semester Aggharta did it, and there are already at least two errors (Trends in BPM Research is called "BPMN" which is something different, and also it is added for WS15/16 instead of SS16). Both of these would not have happened if a parser was used. There might be more. So for the first point: as long as it works on the website, it is faster & safer than manual work. When it doesn't work anymore, it will have to be adapted, as it was now. For the second point: It is sad that the website ist always correct & up to date and that is an entirely different problem that should be tackled. But as long as no one is writing a parser for the pdf, I still believe that the web parser more convinient and less faulty than the current approach. Also the website has more information that the pdf is lacking.

knub commented 8 years ago

I don't know if you have tried adding the courses manually.

Yes. For four years now every semester ;-).

Adding the courses takes an hour of concentrated work per semester, with double- and triple-checking included. For that, I know that I can trust the data completely and do not need to double-check, when I am using the software.

it is faster & safer than manual work. It's definitely not safer than proper manual work, which does not take too long on a per-semester basis, as I wrote above.

When it doesn't work anymore, it will have to be adapted, as it was now.

How will you know, that something breaks? You'd have to manually check the website every semester and see whether that corresponds to the parser output. The website is maintained by persons, who do not care/know about machine readability. They will change the format, when they need it, and they won't tell you.

For the second point: It is sad that the website isn't always correct & up to date and that is an entirely different problem that should be tackled How do you want to tackle this?

In the end, I will not object, if you really want to merge this, because I finished. All I can say is, that I wouldn't use the software, if I knew the data would be extracted with a fragile (no critic on your work, rather a general observation from the circumstances) script, which parses from HTML. And making an error in the data might have serious consequences for a student.