Closed tehnrd closed 11 months ago
It's by design, but it potentially deserves revisiting.
Yes, code here will populate the dataset on S3. Rather than copying planet-latest.osm.pbf
from planet.osm.org, it probably makes sense to copy it from the latest version if/when it was mirrored, similar to how the latest ORC versions are "linked".
That is what I was thinking as there doesn't seem to be a need to copy the entire planet-latest.osm.pbf
file from https://planet.osm.org/pbf/ as the most recent planet-YYMMDD.osm.pbf
file is the same as planet-latest.osm.pbf
...I think, at least right now according to the MD5, but I'm not sure if this is guaranteed to be the case. Maybe an md5 check to confirm but this may over complicate things.
I can think of three approaches:
1) Super Simple Option
Modify the FILES_TO_MIRROR
pattern and sync the entire planet-latest.osm.pbf file. While not the most efficient I think this would be the smallest code change and would "just work" at the expensive of mirroring an extra 44GB+ file that is essentially a duplicate...but bandwidth is "cheap" as is the AWS Batch Job (1vCPU 1GB RAM).
2) Simple Option
When syncing a new planet-YYMMDD.osm.pbf
file, check to see if it is the latest and if so perform an s3 copy to planet-latest.osm.pbf
after it is mirrored.
3) More Robust Option
When syncing a new planet-YYMMDD.osm.pbf
and if is the latest, do an MD5 check against https://planet.osm.org/pbf/planet-latest.osm.pbf.md5, if it matches perform an S3 copy, if it is different launch a separate batch job to pull down and sync https://planet.osm.org/pbf/planet-latest.osm.pbf
If there is a different or better way this could be done, let me know.
Not sure if/when I'll have a chance to take a stab at this PR but would like to if I can squeeze it in. Hopefully, these questions/design approach may help someone else that may want to try a PR as well.
@mojodna I just submitted #25 for this. I went with option 2 as that appears to be the same method used for copying the .orc
files.
This is all a bit new to me so totally open to feedback on the PR, but would definitely like to see this bucket start to include planet-latest.osm.pbf
as it would simplify consuming scripts that need the latest planet file.
First, apologies for the delay in getting back to you. I've been slammed and have fallen behind on a number of things.
I agree with you that option 2 is the most appropriate in this context. Option 1 is certainly cheap, but I do feel a bit weird about pulling an extra 44+ GB from outside of AWS (both for planet.osm.org's sake and due to the time it takes).
The approach in your PR makes sense, but the async
library has stopped providing any readability benefits here and we have the opportunity to modernize and use async
/ await
, so I'd be inclined to expand the scope slightly and do a rewrite of that (p-map
, p-limit
, and friends may be helpful for that).
Agree, took me a while to wrap my head around the async library flow and wasn't thrilled with the results but was trying to keep changes to a minimum using the current approach.
I saw your feedback on the PR will make an attempt at cleaning it up with async/await.
This now happens after some upgrades to the syncing process this week.
I've noticed this does not mirror the
planet-latest.osm.pbf
file.Is this by design? A Bug?
Open to a PR so it is included? And would a PR here eventually make it into the code that populates the AWS hosted datasets here, https://registry.opendata.aws/osm/ http://s3.amazonaws.com/osm-pds ?