nikel-api / nikel-datasets

A collection of datasets for Nikel
MIT License
0 stars 0 forks source link

Pointless Requests #1

Open Multivalence opened 2 years ago

Multivalence commented 2 years ago

The way the nikel API currently gets data is from data dumps that is manually inputted by the maintainer occasionally. Because of this, It is faster to just take the data dump from here rather than sending HTTP requests to the API. Could an auto updating data system be implemented by using web scraping and some of UofT's public API's? This could make it so data is always up to date and would provide a reason to use the API rather than just getting data from this repo.

I understand that the UofT Websites and Public API's keep changing which would require frequent maintenance to the API. I'm happy to assist in the maintenance if this is the case.

Thank you!

darenliang commented 2 years ago

Thanks for your concerns.

I agree its usually better and faster to use the data dump. But it's usually a matter of preference because if web extensions are using the dataset, it might be preferential to request data via a web API. The nikel-core web server does a good job fetching data from the datasets and employs a cache to speed up lookups if necessary. Cloudflare is also used to add a light edge cache for frequently requested data.

Regarding the auto updating data system, I would like for that to be the case. The datasets parsing is really messy and riddled with edge cases. I've tried updating the datasets from time to time, but after a while it often requires manually changing the parsing logic due to the changes on UofT's end.

Rewriting the parser might be the best option we have at the moment but that'll require lots of work (discovering new data sources / making the parsing more accurate). The current parser uses a combination of json requests, html parsing and selenium. I'm hoping that the we don't need to parse html pages and use selenium but that might not be possible.