kshitij10496 / hercules

The mighty hero helping you build projects on top of IIT Kharagpur's academic data
https://hercules-10496.herokuapp.com/api/v1/static/index.html
MIT License
34 stars 18 forks source link

Fix invalid room numbers #25

Closed themousepotato closed 5 years ago

themousepotato commented 5 years ago

There are room numbers with value '0'. Replace those with 'In Dept'. Also, there are 'In Deptt'. Replace those with 'In Dept' ;)

kshitij10496 commented 5 years ago

Awesome! This is a part of data validation step. Where do you suggest this check be added? 🤔

themousepotato commented 5 years ago

Immediately after scraping IMHO. If you want that more structured, you can write a validator which can be called manually after scraper. But, that would be an overkill.

kshitij10496 commented 5 years ago

Agreed! I think we should add the known validation heuristics during the process of scrapping and before the data being written to the output JSON. The scrapper code should be responsible for providing structured, usable and valid data from websites.

Pikachu920 commented 5 years ago

I'll have a go... The code that needs changing is here, right? https://github.com/kshitij10496/hercules/blob/01e93c11248968947cf0786d2eb2694aeec2265d/data/scrapper/course_rooms.py#L82-L86

kshitij10496 commented 5 years ago

Hey @Pikachu920 ! 👋 Thanks for picking this up.

The code that needs changing is here, right?

Yes, I think so too. A validation check here should be the ideal way to fix this.

Thoughts @themousepotato ?

themousepotato commented 5 years ago

Sorry for the late comment. @kshitij10496 You're right. @Pikachu920 That's exactly the validation part. Thanks for finding time to point that out :)

kshitij10496 commented 5 years ago

@Pikachu920 Are you still interested in fixing this? 😄

Pikachu920 commented 5 years ago

absolutely -- i'll try it now!