waldoj / frostline

A dataset, API, and parser for USDA plant hardiness zones.
https://phzmapi.org/
MIT License
151 stars 25 forks source link

Explore creating a geodata API #24

Open waldoj opened 7 years ago

waldoj commented 7 years ago

It would be useful to let people issue queries using lat/lon instead of just ZIP. But I'm only interested in doing so if I can use a very lightweight hosting process, as per the rest of Frostline.

waldoj commented 7 years ago

Theory: I can use PRISM's geodata to generate static files for lat/lon pairs at a reasonable resolution. Then queries can be issued at a prescribed level of specificity (e.g., no more than 0.1 degrees, or 11 kilometers, e.g. "39.1, 94.6"). By generating a file for each pair (e.g., 39.1,94.6.json), a static API can serve up all of the data.

waldoj commented 7 years ago

The simplest way to do this is probably via the provided ARC/INFO ASCII grid files. This represents Puerto Rico as 557 columns by 170 rows, or 94,690 files for that one territory. (Note that each record is not a hardiness zone, but instead the minimum temperature in Celsius, multiplied by 100; it's necessary to convert that to a PHZ before writing out the data.) This implies rather a large number of files for the entire U.S.

waldoj commented 7 years ago

Of course, it remains to map each cell to a physical location. But this looks easy. The opening stanza of each ASC files opens with metadata like such:

xllcorner -67.31875000000
yllcorner 17.86875000000
cellsize 0.00416666667

So we simply add 0.00416666667 to -67.31875 as we advance through each column, and subtract 0.00416666667 from 17.86875 as we advance through each row.

waldoj commented 7 years ago

One concern that I have is about the resolution that this yields. Is it reasonably round-able? Will there be collisions? How to handle them?

waldoj commented 7 years ago

The continental U.S. is 7,025 columns by 3,105 rows, or 21,812,625 files. That certainly is a very large number of files. Hawaii is another 1,077,008 records, and Alaska another 808,505 (an oddly small number), a total of 23,792,828 records. Some percentage of these will be blank (that is, have a value of -9999), although the approximate shape of the U.S. means that it will not be an enormous percentage. Perhaps 20%, thanks to Florida and Maine? So that leaves about 19 million records, should no simplification be performed.

waldoj commented 7 years ago

It turns out that these files are at different resolutions. The continental U.S. is at 800-meter resolution, Hawaii and Puerto Rico are at 400-meter resolution, and Alaska is at 4,000-meter resolution. That raises the interesting question of how to apply consistent data-density standards across the board.

The continental U.S. data has a cell size of 0.00833333333, or just under 0.01 degrees. At a resolution of 0.1 degrees, we'd be using every 13th cell, or 1/169th of the entire dataset. That would leave us with a very manageable 104,142 JSON files for the continental U.S. (Whether that is sufficient resolution to accurately capture PHZ is a different question.)

waldoj commented 7 years ago

If we had a resolution of 0.01 degrees, that would leave us with about 14,666,080 records for the continental U.S., or 140 times more than at 0.1 degrees of resolution.

waldoj commented 7 years ago

Each of the 26 zones represent a spread of 5°. So these zones are not particularly refined. Spot-checking some rows from the data, I feel good that a resolution of 0.1 degrees is adequate.

However, aggregating lat/lon pairs is inherently going to result in some inaccuracies, for places on the bubble. For instance, these two stanzas

-18 -18 -18 -19 -19 -19 -19 -19 -20 -20 -20 -20 -20
-20 -20 -19 -19 -19 -19 -19 -18 -18 -19 -20 -20 -20

would be reduced to -18 and -20 (assuming we sampled the first entry), or 5a and 4b, respectively. But 5 of the entries in the first stanza are actually 4b, and 8 of the entires in the second stanza are actually 5a. Averaging doesn't help this problem. In this instance, the average of both is -19, and it would leave every one of these places in zone 5a, despite that 10 of the 26 are in zone 4b.

Basically, the question here is what level of accuracy is acceptable. Perhaps it's worth a reduction in accuracy to reduce the number of files by 99.5%. But is it worth any reduction in accuracy to cut out just 17% of records?

waldoj commented 7 years ago

The next thing to do is some benchmarking—figure out how much data that we're talking about here, and how long it will take to generate those files.

waldoj commented 7 years ago

If we use the same file format as we do for the ZIPs (unnecessarily repeating the lat/lon pair within the JSON), that's 92 bytes per file, or 1.7 GB of data for every data point in the entire U.S. (assuming, as always, that 20% of data points in the ASC files have no value). That's not bad. If we used a resolution of 0.1 degrees, that would be a mere 8 MB in total.

If we eliminate the repeated lat/lon pair, that brings the total size down to 831 MB.

waldoj commented 7 years ago

An advantage of using the native file resolutions is that this largely allows us to avoid the question of what to do about three different resolutions. But it may not help with the question of the prescribed degree of precision in the queries.

That the native continental data resolution is just below 0.01 degrees of resolution means that, of course, we need to use three digits. This is problematic though, because we wind up with vast swaths of namespace that are blank (e.g., we have a record for "39.101, 94.655," but not for the requested "39.102, 94.655"). This indicates that—again, just for the continental U.S.—we need to round to two decimal digits. Basically, we are inherently going to wind up with a certain degree of inaccuracy, but this 13% reduction means that no more than 13% of records have a chance of becoming inaccurate.

Alaska presents an awkward arrangement, with its 4,000-meter resolution. We're going to have to either have less granularity on Alaska queries (this is bad), or used something like a nearest-neighbor algorithm to fake it.

Hawaii and Puerto Rico's 400-meter resolution will necessitate rather more averaging, so we'll have more edge cases there.

mlissner commented 7 years ago

I'm just going to throw this out there...what about symlinks? Millions of symlinks? They're smaller than actual json files and they'd allow you to have whatever granularity you wanted with no gaps.

waldoj commented 7 years ago

Ooooh, of course! Excellent idea!