Open ajfriend opened 3 months ago
It would be nice to have a collection of data sets using H3 that folks can use for examples or are just generally useful.
This seems like it could be helpful as a reference dataset.
Some ideas:
Another one that comes to mind are the various US census geometries (essentially, anything in the TIGER dataset).
https://geodatasets.readthedocs.io/en/latest/introduction.html is a Python package that does something similar, but for general geographic datasets.
Aside from what examples we want, I think we'd also need to decide:
* what data format we'd use, or if we'd use multiple
I think it would make sense to have multiple formats, some users might want a simple text based format like CSV or JSON, while others may prefer efficient binary formats like Parquet (as uint64).
* how we store the examples---in the repo, or point to external hosting
Considering the format duplication, the fact that the text files can be very large, and the relatively independent maintenance concerns, I recommend outside of the repo. I believe we already do that in master
for country geometries used in testing.
I think it would make sense to have multiple formats, some users might want a simple text based format like CSV or JSON, while others may prefer efficient binary formats like Parquet (as uint64).
Agreed.
Considering the format duplication, the fact that the text files can be very large, and the relatively independent maintenance concerns, I recommend outside of the repo. I believe we already do that in
master
for country geometries used in testing.
Yes, I definitely agree we should host these through a separate repo (maybe something like h3datasets
?). It was more that I was wondering if in that repo we host the raw data, or if it should point to some other storage location. The geodatasets
package uses the latter strategy. If we were using the former strategy, I was curious if we thought we might run into github file and repo size limits (the repo we point to here comes in at 17GB). Maybe we can start with the in-repo approach and pivot to external hosting if necessary. If we do end up needing external storage, any ideas on what services we might use?
Yes, I definitely agree we should host these through a separate repo (maybe something like
h3datasets
?). It was more that I was wondering if in that repo we host the raw data, or if it should point to some other storage location. Thegeodatasets
package uses the latter strategy. If we were using the former strategy, I was curious if we thought we might run into github file and repo size limits (the repo we point to here comes in at 17GB). Maybe we can start with the in-repo approach and pivot to external hosting if necessary. If we do end up needing external storage, any ideas on what services we might use?
Ah, I see. The two options I'd suggest are S3 and Cloudflare R2. R2 is cheaper and more modern (which incidentally can cause issues if you happen to use HTTP-only software, as it enforces SSL). In the mean time in the repo seems like an OK place to start.
It would be nice to have a collection of data sets using H3 that folks can use for examples or are just generally useful.
Some ideas:
https://geodatasets.readthedocs.io/en/latest/introduction.html is a Python package that does something similar, but for general geographic datasets.
Aside from what examples we want, I think we'd also need to decide: