hillsider-admins / hillsider-admins.github.io

Collaborative build of hillsiders.org – a project of the Hills Admins
https://hillsiders.org
1 stars 3 forks source link

Replace web scraping with API #17

Closed 5jt closed 2 years ago

5jt commented 3 years ago

Background

The server side script parking-suspensions.sh scrapes four web pages on the Camden website to find parking suspensions. It depends on the exact way that Camden marks up the suspensions in HTML tables. Changes to how Camden marks up the data in HTML could break the script. (This is an inherent vulnerability of scraping web pages.)

Camden provides a data portal with an API (application program interface) by which other programs can request data from its databases. Unlike the format of its web pages, which it changes without notice, the format of the data supplied through the API is stable and supported by Camden.

Expected behaviour

Retrieves parking suspensions from Camden in a stable format.

Observed behaviour

Scrapes data from HTML in Camden web pages. May break without warning.

Possible fix

AJAX is a way for JavaScript to reach out to other web sites and services.

5jt commented 3 years ago

The Camden API could be used in either of two ways:

  1. On the server, in parking-suspensions.sh: replace calls to the web pages with queries to the Open Data API.
  2. In the web page JavaScript: replace the call to /data/table.html with queries to the Open Data API.

(2) would mean the serverside script could be removed entirely. Each user’s browser would call Camden for up-to-the-minute parking records. Instead of a query every hour, there would be 1-4 queries (depending on the API) every time a browser visited hillsiders.org.

Would that overburden Camden‘s data portal? Our site has few users; the result might be fewer queries than the 48 page requests our server now makes every day. On the other hand, web spiders like Google crawl over the site all the time. Each visit would trigger parking-suspension queries. That could be mitigated by using a button on the page to trigger the script, so queries would go to Camden only at a user’s request.

5jt commented 2 years ago
https://opendata.camden.gov.uk/resource/av3b-8trg.json
wget -O qry.json 'https://opendata.camden.gov.uk/resource/av3b-8trg.json?$query=SELECT space_identifier,suspension_reference,suspension_start_date,suspension_end_date,road_name WHERE road_name = "PARLIAMENT HILL"'
5jt commented 2 years ago

The OpenData table has no column that corresponds to the Location column in the HTML we are scraping. Nor can I see another table that maps parking bays to such text descriptions.

Without these text descriptions of the locations we can’t replace scraping the web pages.

5jt commented 2 years ago

Location string

Camden’s web pages with tabulated parking suspensions include a Location column with a string such as "Outside No 63". Please either include this column in the dataset or in the table of parking bays.

This is the second request for this data. A response would be appreciated.

Regards Stephen Taylor FRSA chair@hampsteadforum.org

5jt commented 2 years ago
https://opendata.camden.gov.uk
5jt commented 2 years ago

The interactive map seems popular. Tabulating suspensions beneath it would add little value, but would further slow the page as it queried Camden’s database through the API.