Add nature data source - Githubissues

jacobbudin commented 1 year ago

e.g., National parks, state parks, NatureScore, Rails-to-Trails

grelas commented 1 year ago

Could take an initial stab at this one :)

NatureScore looks like it only takes a complete address, so not sure how you could get a score from an entire city.

For TrailLink, don't see an api available but is the idea to just request page and grab result count? e.g. https://www.traillink.com/trailsearch/?city=Portland&state=OR. Could see this taking forever if we're trying to make http requests for each city...

Were you thinking of having an abstract Nature source - e.g. nature.py that would aggregate results from multiple sources - TrailLink, NatureScore, etc or have individual sources like traillink.py, naturescore.py, etc that you'd import and configure?

jacobbudin commented 1 year ago

@grelas Do it.

NatureScore: I agree. It does seem to require a full address (e.g., returns a 302 response for ?loc=Portland,%20OR,%20USA).
TrailLink: HTTP APIs are OK for scorers and dimensions. The Yelp source uses an HTTP API. Just implement the same cache strategy so you don't make the same requests repeatedly. Unfortunately, there's currently no throttling mechanism in this project, so for testing, I'd use criteria that limit you to ≤10 cities.¹ This project already bundles the requests package, so reuse that.

My design here: Sources should generally have names that reflect "who" (the data comes from), not "what" (the data provides), so that, in theory, if I wanted to run my scores using Redfin (not Zillow) data, I could then swap:

-from city_score.sources.zillow import *
+from city_score.sources.redfin import *

sources = (
    PeopleForBikes,
-   Zillow,
+   Redfin,
)

These would provide the same functions (e.g., maximum_median_home_price) and would "just work" with no other changes.

Good luck. 🙏🏻

¹ With no throttling and a large number of requests, some firewall will eventually block you.

grelas commented 1 year ago

@jacobbudin Here's a draft PR for adding a TrailLink source https://github.com/jacobbudin/city-score/pull/7/files. Quite weak writing python so all feedback welcome (including nit stuff).

Approach was to request page and parse hidden input value - e.g. <input id="hidtotalcount" name="ResultsCount" type="hidden" value="115" /> - which represents number of trails returned by TrailLink. Didn't see an underlying endpoint to retrieve data

Kinda want to look around to see if there are other sources that could provide some more info...

grelas commented 1 year ago

NatureScore does accept lat/long coordinates for an address actually, which I see we have available with city.coordinates.

For example,

curl -H "X-Api-Key: MsZBURNDcW9805N66Ivv3DQU6x5gczF3UL29VsWj" https://api.naturescore.io/naturescore?latitude=45.442919&longitude=-122.615092

which returns

{
    "score": 49.2,
    "latitude": 45.442919,
    "longitude": -122.615092,
    "leaf_rating": {
        "rating": 3,
        "classification": "Nature Adequate",
        "description": "Balanced mix of health-advantageous natural and built environmental elements. Modest effort required for immersive nature exposure opportunities."
    },
    "who_greenspace_accessible": false,
    "address": ""
}

Alternatively, could use Google Maps Places API to search given a lat/long coordinates, keyword (eg "parks") and a radius (eg 10 miles). https://developers.google.com/maps/documentation/places/web-service/search-nearby#nearby-search-responses.

I kinda like the idea of nearby_parks as a dimension. And possibly a nature_score dimension.

jacobbudin commented 1 year ago

@grelas Re NatureScore: Great find. I suspect an issue though is most cities will produce large variations in scores depending on the precise coordinates/address you use. e.g., Long Island City, as you know, you have leafy residential areas very close to fulfillment centers and auto repair shops (Google Maps). I'd be happy being wrong on this though.

Re Google Maps/parks: Into it. I'd be curious if you could aggregate the result somehow in a way that's "fair". Example (made up numbers):

New York, NY - 500 parks, 10 sq. miles
Ames, IA - 6 parks, 15 sq. miles

grelas commented 1 year ago

Re Google Maps/parks: Into it. I'd be curious if you could aggregate the result somehow in a way that's "fair". Example (made up numbers):

Hmm good point. Maybe we can determine median park size and factor in population? This assumes "access to larger parks = better".

Did find this ParkScore tool https://www.tpl.org/parkscore that factors in alot of data. Here's example reports:

Think I'm describing the "Acreage" factor here.

Anyway, couldn't figure out how (or if) we can utilize this data somehow - see https://www.tpl.org/park-data-downloads

jacobbudin commented 1 year ago

@grelas

Maybe we can determine median park size and factor in population? This assumes "access to larger parks = better".

Park acreage per capita sounds good. If you can't find an exact formula that seems right, or want to provide options, you can add a keyword argument to reflect that (e.g., Yelp provides exact, Snowpak provides exclude, to give the user some options/discretion).

I do think TPL's "% who live within a 10-minute walk" is an interesting and useful result that does not need further manipulation.

Did find this ParkScore tool https://www.tpl.org/parkscore that factors in alot of data.

TPL's data looks good. I'd scape it from the web. They score cities that don't appear in the PDF or data tables (e.g., Waterbury, CT). That said, they don't score every city (e.g., Waterbury, VT is "not enabled", whatever that means).

I'll review your TrailLink PR later, but web APIs and scrapes shouldn't provide @criterion functions. Because if a user inadvertently uses it as their first criteria, the app will send 28k requests, one for each US city and town. (So if you do scrape TPL, don't implement @criterion.)

grelas commented 1 year ago

@jacobbudin Took a stab at adding this standalone TrustForPublicLand source https://github.com/jacobbudin/city-score/pull/12. Wondering how this is going to work when scraping thousands of pages...we'll eventually either exceed API limits or get blocked 😅

jacobbudin / city-score

Add nature data source #1