tangledhelix / dp_pp_utils

Utility stuff for my Distributed Proofreaders Post-Processing work
MIT License
0 stars 0 forks source link

Convert web scraping task(s) to use REST API #17

Open tangledhelix opened 6 months ago

tangledhelix commented 6 months ago

Particularly in make_project.py, anywhere that an HTML fetch & parse is happening, try to use the REST API instead.

tangledhelix commented 6 months ago
tangledhelix commented 6 months ago

https://github.com/DistributedProofreaders/dproofreaders/blob/master/api/USERS_GUIDE.md

tangledhelix commented 6 months ago

GET /projects/{projectID} will supply:

I don't see the forum link in the output. No other API calls seem like they'd have it either.

I see no API calls that could fetch the text or image files.

Does the API key let you into the web site as well or only into the API methods?

tangledhelix commented 6 months ago

Auth is via a header X-API-KEY with a value of the key.

tangledhelix commented 6 months ago

Implemented the API in make_project.py to fetch the title, author, comments. Still need to scrape to get the forum link. Added project comments, previously this was done by hand.

tangledhelix commented 6 months ago

Nothing else to do here unless/until the upstream API has the forum_link as well.