Convert web scraping task(s) to use REST API

tangledhelix / dp_pp_utils

Utility stuff for my Distributed Proofreaders Post-Processing work

MIT License

0 stars 0 forks source link

Convert web scraping task(s) to use REST API #17

Open tangledhelix opened 6 months ago

tangledhelix commented 6 months ago

Particularly in make_project.py, anywhere that an HTML fetch & parse is happening, try to use the REST API instead.

tangledhelix commented 6 months ago

pgdp_login() can likely be replaced by a function to create an HTTP session to the API
- If not then a new function to do that should be added
Rewrite scrape_project_info(). Currently it is responsible to grab:
- Book title
- Author
- Forum link
Look at download_text(), download_images() and see if API replacements exist

tangledhelix commented 6 months ago

https://github.com/DistributedProofreaders/dproofreaders/blob/master/api/USERS_GUIDE.md

tangledhelix commented 6 months ago

GET /projects/{projectID} will supply:

title
author
comments which could be interpolated into README.md during rendering

I don't see the forum link in the output. No other API calls seem like they'd have it either.

I see no API calls that could fetch the text or image files.

Does the API key let you into the web site as well or only into the API methods?

tangledhelix commented 6 months ago

Auth is via a header X-API-KEY with a value of the key.

tangledhelix commented 6 months ago

Implemented the API in make_project.py to fetch the title, author, comments. Still need to scrape to get the forum link. Added project comments, previously this was done by hand.

tangledhelix commented 6 months ago

Nothing else to do here unless/until the upstream API has the forum_link as well.