Closed pnorman closed 1 year ago
Sponsorship accept and billing linked.
Plan is for eu-west-1 (Dublin) as primary region because it is near our locations, green, and lower priced so has more capacity. us-west-2 is secondary region to match other datasets in AWS and provide a closer option for processing on the west coast and in Asia.
Item | Path |
---|---|
Minutely diffs | /replication/minute/ |
Hourly diffs | /replication/hour/ |
Daily diffs | /replication/day/ |
Changeset minutely diffs | /replication/changesets/ |
Deleted users list | /users_deleted/users_deleted.txt |
Tile request logs | /tile_logs/tiles-YYYY-MM-DD.txt.xz |
Tile app usage | /tile_logs/apps-YYYY-MM-DD.csv |
Tile host usage | /tile_logs/hosts-YYYY-MM-DD.csv |
Data statistics | /statistics/data_stats.html |
Notes | /notes/YYYY/planet-notes-YYMMDD.osn.bz2 , with -latest version and md5 |
PBF RSS feed | /pbf/planet-pbf-rss.xml |
PBF | /pbf/planet-YYMMDD.osm.pbf , with -latest version, md5, and torrent |
PBF History RSS feed | /pbf/full-history/history-pbf-rss.xml |
PBF History | /pbf/full-history/history-YYMMDD.osm.pbf , with -latest version, md5, and torrent |
OSM BZIP RSS feed | /planet/planet-bz2-rss.xml |
OSM BZIP | /planet/YYYY/planet-YYMMDD.osm.bz2 , with -latest, md5, and torrent |
Discussions RSS feed | /planet/discussions-bz2-rss.xml |
Discussions | /planet/YYYY/discussions-YYMMDD.osm.bz2 , with -latest, md5, and torrent |
Changesets RSS feed | /planet/changesets-bz2-rss.xml |
Changesets | /planet/YYYY/changesets-YYMMDD.osm.bz2 , with -latest, md5, and torrent |
History OSM RSS feed | /planet/history/changesets-bz2-rss.xml |
History OSM BZIP | /planet/history/YYYY/history-YYMMDD.osm.bz2 , with -latest, md5, and torrent |
Users agreed lists | /users_agreed/ , three lists |
Item | Path |
---|---|
Experimental OSM BZIP | /planet/experimental/planet-YYYY-MM-DD.osm.bz2 , with md5 |
Experimental Changesets | /planet/experimental/changesets-YYYY-MM-DD.osm.bz2 , with md5 |
History OSM BZIP | /planet/experimental/history-YYYY-MM-DD.osm.bz2 , with md5 |
CC BY-SA | /cc-by-sa/ includes planet, replication, history, and changesets like above |
Redaction period minutely diffs | /redaction-period/minute-replicate/ |
Redaction period hourly diffs | /redaction-period/hour-replicate/ |
Redaction period daily diffs | /redaction-period/day-replicate/ |
GPS points | /gps/simple-gps-points-YYMMDD.txt.xz , also CSV format |
GPX dump | /gps/gpx-planet-YYYY-MM-DD.tar.xz |
coastcheck polygons | /historical-shapefiles/processed_p.tar.bz2 |
coastcheck lines | /historical-shapefiles/shoreline_300.tar.bz2 |
world_boundaries | /historical-shapefiles/world_boundaries-spherical.tgz |
Some of these items may never happen, but it's worth thinking about where they would fit into a new layout
Item | Source |
---|---|
Downloads that require authentication | Existing planet-dump-ng and osmdbt |
Notes replication | Unknown |
Hourly/daily changeset diffs | Existing changesets |
GPX dump | Unknown |
GPS points dump | Unknown |
To avoid any misinformation. The planet hosting discussed here is being sponsored by the AWS public data program and will cost us $0/month in AWS expenses. We have already been accepted onto the program and the billing is AWS tab, not ours.
A proposal for the S3 bucket prefix "folders"
Frequency | Product | Flavour | AWS Bucket Prefix |
---|---|---|---|
minutely | replication | minutely | planet/replication/minute/ |
hourly | replication | hourly | planet/replication/hour/ |
daily | replication | daily | planet/replication/day/ |
weekly | planet | pbf | planet/pbf/YYYY/ |
weekly | planet | osm | planet/osm/YYYY/ |
weekly | planet-full-history | pbf | planet-full-history/pbf/YYYY/ |
weekly | planet-full-history | osm | planet-full-history/osm/YYYY/ |
minutely | replication-changesets | minutely | changesets/replication/minute/ |
weekly | changesets | osm | changesets/osm/YYYY/ |
weekly | discussions | osm | discussions/osm/YYYY/ |
daily | notes | osn | notes/osn/YYYY/ |
daily | tile_logs | hosts | tile_logs/standard_layer/hosts/YYYY/ |
daily | tile_logs | countries | tile_logs/standard_layer/countries/YYYY/ |
daily | tile_logs | apps | tile_logs/standard_layer/apps/YYYY/ |
daily | tile_logs | tiles | tile_logs/standard_layer/tiles/YYYY/ |
daily | statistics | data_stats.html | ? |
daily | users_deleted | txt | ? |
Proposal: https://hackmd.io/3JkGPWwIQE2ClLBN8zPApQ
Keep year. Files in ISO date.
A few questions:
.osm.bz2
in the tree. 1. Which region(s)? We talked last year about using eu-central-1 (Frankfurt) and us-west-2 (Oregon, via native S3 replication) for sustainability and co-location with other geospatial datasets reasons.
As per above,
Plan is for eu-west-1 (Dublin) as primary region because it is near our locations, green, and lower priced so has more capacity
The pricing still encourages Dublin over Frankfurt, and they're both 100% renewable.
Should the region be included in the bucket name?
This is generally a bad practice since it allows someone to claim the name of the bucket for another region, but it might make sense here.
2. Would it make sense to create multiple SNS topics (with more/less/different noise) to serve different purposes?
SNS is phase 2, so currently unplanned.
3. What happened to the PBF version of the history dump? I only see
.osm.bz2
in the tree.
It's present in the table, the first planet-full-history
row.
4. For redirects, would it make sense to include the current file name in the body of the object (separate from the metadata), either as text or JSON to help S3 users follow redirects?
The only redirects are going to be HTTP redirects from planet.osm.org, nothing within S3. I'm not sure what you mean - were you assuming doing some redirects within S3? A user will go to https://planet.openstreetmap.org/planet/pbf/YYYY/planet-YYYY-MM-DD.osm.pbf
, get a HTTP 3xx redirect to https://name-not-yet-set.s3.eu-west-1.amazonaws.com/planet/pbf/YYYY/planet-YYYY-MM-DD.osm.pbf
and download from there. If a user wants to use the s3 API (e.g. through the AWS CLI) instead of a normal HTTP download, they'll have to manage that themselves.
5. What storage option are you planning to default to? Intelligent-Tiering is probably the right choice.
Agreed, intelligent-tiering is appropriate. A few MS of latency doesn't matter on a multi-GB file, or even a replication diff.
Reading the s3 docs for these responses made me aware of a couple of other issues.
example.planet.openstreetmap.org
and we'd be unable to get example.planet.osm.org
to work the same way, but it has the advantage of allowing geodns so we could direct users at the closest bucket.A few questions:
- Which region(s)? We talked last year about using eu-central-1 (Frankfurt) and us-west-2 (Oregon, via native S3 replication) for sustainability and co-location with other geospatial datasets reasons. Should the region be included in the bucket name?
Bucket: osm-planet-eu-central-1
Region: eu-central-1
(primary) Logging: osm-planet-logs-eu-central-1
Bucket: osm-planet-us-west-2
Region: us-west-2
(secondary - S3 replicated) Logging: osm-planet-logs-us-west-2
- Would it make sense to create multiple SNS topics (with more/less/different noise) to serve different purposes?
Yes, it would make sense. I have not yet finalised on the SNS configuration.
- What happened to the PBF version of the history dump? I only see
.osm.bz2
in the tree.
See above, it will be moved to prefix planet-full-history/pbf/YYYY/
- For redirects, would it make sense to include the current file name in the body of the object (separate from the metadata), either as text or JSON to help S3 users follow redirects?
Additionally to Paul's answer. I am thinking of adding a latest.txt
which contains the full prefix + object name.
- What storage option are you planning to default to? Intelligent-Tiering is probably the right choice.
Default to Intelligent-Tiering
with a lifecycle rule to move objects mistakenly uploaded as standard
to Intelligent-Tiering.
The only redirects are going to be HTTP redirects from planet.osm.org, nothing within S3. I'm not sure what you mean - were you assuming doing some redirects within S3? A user will go to https://planet.openstreetmap.org/planet/pbf/YYYY/planet-YYYY-MM-DD.osm.pbf, get a HTTP 3xx redirect to https://name-not-yet-set.s3.eu-west-1.amazonaws.com/planet/pbf/YYYY/planet-YYYY-MM-DD.osm.pbf and download from there. If a user wants to use the s3 API (e.g. through the AWS CLI) instead of a normal HTTP download, they'll have to manage that themselves.
I was acknowledging that S3 doesn't support redirects / symlinks and pondering some ways that we could facilitate S3 downloads of "latest".
Additionally to Paul's answer. I am thinking of adding a latest.txt which contains the full prefix + object name.
This is better than my proposal ;-)
Spitballing here... Re: buckets + naming, because of the constraint that the bucket name needs to match the CNAME, it may make sense to have a planet.osm.org
(and/or planet.openstreetmap.org
that serves to redirect to the one that contains HTML) in addition to region-specific buckets. The planet
bucket would contain static HTML + use some sort of client-side determination to figure out which region-specific bucket is closer to provide download links that go direct to the region-specific buckets. That way, we get the benefit of regionalization (for "closeness") and fully-managed website hosting (which can also register redirects to one of the buckets, preserving URLs that work today).
Spitballing here... Re: buckets + naming, because of the constraint that the bucket name needs to match the CNAME, it may make sense to have a
planet.osm.org
(and/orplanet.openstreetmap.org
that serves to redirect to the one that contains HTML) in addition to region-specific buckets. Theplanet
bucket would contain static HTML + use some sort of client-side determination to figure out which region-specific bucket is closer to provide download links that go direct to the region-specific buckets. That way, we get the benefit of regionalization (for "closeness") and fully-managed website hosting (which can also register redirects to one of the buckets, preserving URLs that work today).
Yes, my intention is keep planet.osm.org (and aliases) up and running for compatibility and handling some legacy redirects. I'll likely gradually move it across to a thin near-stateless wrapper. I have been looking into: https://github.com/rufuspollock/s3-bucket-listing and https://github.com/qoomon/aws-s3-bucket-browser which are JS driven S3 "browsers" which pull live data via the S3 api and prove a frontend to the buckets.
S3 Buckets have been created. S3 Cross Region Replication has been setup. S3 Logging has been setup.
All managed via code in OSM's private terraform aws repo.
I have started to backfill existing planet data. It will take awhile.
For tile_logs I'd like to subdivide it by service, where the only current service is the standard tile layer.
I have created the upload user, roles and permissions.
I have started uploading replication diffs.
I have started uploading 2022 and 2023 osm planet files.
For tile_logs I'd like to subdivide it by service, where the only current service is the standard tile layer.
standard_layer
OK or would you prefer another name?
For tile_logs I'd like to subdivide it by service, where the only current service is the standard tile layer.
standard_layer
OK or would you prefer another name?
looks good to me
@Firefishy what does the S3 bucket policy currently look like? it appears to be rejecting signed requests (while accepting unsigned ones):
❯ aws s3 cp --no-sign-request s3://osm-planet-eu-central-1/planet/replication/minute/state.txt -
#Tue Sep 19 14:51:01 UTC 2023
sequenceNumber=5754996
timestamp=2023-09-19T14\:50\:59Z
❯ aws s3 cp s3://osm-planet-eu-central-1/planet/replication/minute/state.txt -
download failed: s3://osm-planet-eu-central-1/planet/replication/minute/state.txt to - An error occurred (400) when calling the HeadObject operation: Bad Request
It should look something like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::osm-planet-<region>"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": [
"s3:GetObject",
"s3:GetObjectAttributes",
"s3:GetObjectTagging"
],
"Resource": "arn:aws:s3:::osm-planet-<region>/*"
}
]
}
Additional bucket policies added in https://github.com/openstreetmap/terraform-aws/commit/d815e7fed334471e645477fba3feacf10b6cf0ec (private repo) @mojodna I've sent you an invite to access the repo a few minutes back.
❯ aws s3 cp --no-sign-request s3://osm-planet-eu-central-1/planet/replication/minute/state.txt - #Tue Sep 19 14:51:01 UTC 2023 sequenceNumber=5754996 timestamp=2023-09-19T14\:50\:59Z ❯ aws s3 cp s3://osm-planet-eu-central-1/planet/replication/minute/state.txt - download failed: s3://osm-planet-eu-central-1/planet/replication/minute/state.txt to - An error occurred (400) when calling the HeadObject operation: Bad Request
Both these above now work.
Note: aws s3 cp
only works if there is a AWS default profile setup, otherwise the cli bombs out with Unable to locate credentials
when it fails to retrieve auth details from the Instance metadata service.
Turns out I was using a bad profile resulting in invalid credentials, but thanks for adding those; GetObjectTagging
in particular isn't an obvious permission (while causing issues in certain circumstances).
Initial pass sync is done.
For tile_logs I'd like to subdivide it by service, where the only current service is the standard tile layer.
standard_layer
OK or would you prefer another name?looks good to me
I've updated the table in https://github.com/openstreetmap/operations/issues/678#issuecomment-1710434488 with the update.
The initial sync script: https://gist.github.com/Firefishy/6de913da9b22363da5683e8695c7396a It has now synced all data.
As an initial test I have set all standard 2023 planet.osm to redirect to the S3 osm-planet-eu-central-1
bucket instead of redirecting to any other mirrors. I will check logs later to see how it is going.
Are any of the Planet.osm mirrors still using rsync to keep files in sync with planet.osm.org? Would that cause any issues with the new S3 based setup?
Are any of the Planet.osm mirrors still using rsync to keep files in sync with planet.osm.org? Would that cause any issues with the new S3 based setup?
The intention is to keep the rsync service available for now. We will reduce the amount of historical data we store on planet.openstreetmap.org, which will limit what is available for rsync. Retiring the rsync service will be given with a lot of notice if/when we decide to do this.
HTTPS based planet.openstreetmap.org will have compatibility redirects between old layout and new layout. I am trying to work out how smart I can be with redirecting to AWS. eg: Ideally I'd like to at least send North America requests to the AWS us-east-2
us-west-2
bucket and EU to eu-central-1
bucket, the download speed difference is noticeable.
The observed sync latency between eu-central-1
and us-east-2
us-west-2
was upto 12 minutes when pushing the new weekly .osm.bz2
and .pbf
planet and full-history planet files. AWS's recommended alerting level is 15mins. The sync is non-sequential. Maybe due to this we should ONLY redirect to us-east-2
us-west-2
for files other than replication diffs.
The long term goal would be for planet.openstreetmap.org to become a thin wrapper over the S3 buckets.
The observed sync latency between
eu-central-1
andus-east-2
was upto 12 minutes when pushing the new weekly.osm.bz2
and.pbf
planet and full-history planet files. AWS's recommended alerting level is 15mins. The sync is non-sequential. Maybe due to this we should ONLY redirect tous-east-2
for files other than replication diffs.
Based on the above, I'm not sure we should redirect at all. We would need to
We will reduce the amount of historical data we store on planet.openstreetmap.org, which will limit what is available for rsync.
Since most mirrors seem to delete their local copies as soon we start cleaning up on planet.osm.org, it's getting more difficult to download older dumps. HTTPS based redirects to AWS would be most welcomed to avoid this issue for end users. However, I don't think this will solve the issue for mirrors. Maybe they need to review their retention policy, and keep older dumps around for longer.
I've noticed this issue while looking for the very first ODbL planet "planet-120912.osm.bz2". The torrent link didn't help, and there's only a single mail.ru mirror out there that still has this file.
I've noticed this issue while looking for the very first ODbL planet "planet-120912.osm.bz2". The torrent link didn't help, and there's only a single mail.ru mirror out there that still has this file.
I have planet-120912.osm.bz2
archived in a deep archive S3 bucket. I can restore it and upload it to the new S3 planet buckets. I have started the restore. Putting back the entire catalogue will be a very slow process, S3 deep archive stores are expensive and slow to restore.
Since most mirrors seem to delete their local copies as soon we start cleaning up on planet.osm.org, it's getting more difficult to download older dumps.
I've run into this, too, lately. How about a policy where the first dump of every year remains easily accessible?
That's exactly what the expiry script does - in fact it goes further than that and keeps the first one of each month as well as the last four weeks worth and as it only runs once a month that means between four and eight weeks worth are kept and then the first of each month before that.
planet.openstreetmap.org redirects in PR: https://github.com/openstreetmap/chef/pull/624 Replication diffs and a few other file types are not yet redirected.
Only previous years are currently redirected to us-west-2
.
Planet redirects to S3 are live.
I am going to close this ticket, the few remaining items are best dealt with as separate tickets.
@mmd-osm planet-120912.osm.bz2 is now available.
With #660 we plan to move the planet hosting to S3. This will allow us to retain more planet files, improve download speeds, and reduce hardware requirements for hosting the planet.
The plan as discussed last meeting is to keep the planet.openstreetmap.org interface as-is, but have the planet URLs redirect to the HTTPS URL for s3. Because we won't be paying for outgoing bandwidth as part of #660, we don't need to use signed URLs or anything to enforce people go through planet.osm.org.
Phase 1
Re-organise planet layout
Phase 2
We are not planning on reproducing some of the functionality of @mojodna' osm-pds setup, ORC files.
Special considerations
<osmChange/>
element