mojodna / osm-pds-pipelines

OSM PDS pipeline
https://quay.io/repository/mojodna/osm-pds-pipelines
ISC License
31 stars 4 forks source link

planet-latest.osm.pbf file missing #24

Closed tehnrd closed 11 months ago

tehnrd commented 5 years ago

I've noticed this does not mirror the planet-latest.osm.pbf file.

Is this by design? A Bug?

Open to a PR so it is included? And would a PR here eventually make it into the code that populates the AWS hosted datasets here, https://registry.opendata.aws/osm/ http://s3.amazonaws.com/osm-pds ?

mojodna commented 5 years ago

It's by design, but it potentially deserves revisiting.

Yes, code here will populate the dataset on S3. Rather than copying planet-latest.osm.pbf from planet.osm.org, it probably makes sense to copy it from the latest version if/when it was mirrored, similar to how the latest ORC versions are "linked".

tehnrd commented 5 years ago

That is what I was thinking as there doesn't seem to be a need to copy the entire planet-latest.osm.pbf file from https://planet.osm.org/pbf/ as the most recent planet-YYMMDD.osm.pbf file is the same as planet-latest.osm.pbf...I think, at least right now according to the MD5, but I'm not sure if this is guaranteed to be the case. Maybe an md5 check to confirm but this may over complicate things.

I can think of three approaches:

1) Super Simple Option Modify the FILES_TO_MIRROR pattern and sync the entire planet-latest.osm.pbf file. While not the most efficient I think this would be the smallest code change and would "just work" at the expensive of mirroring an extra 44GB+ file that is essentially a duplicate...but bandwidth is "cheap" as is the AWS Batch Job (1vCPU 1GB RAM).

2) Simple Option When syncing a new planet-YYMMDD.osm.pbf file, check to see if it is the latest and if so perform an s3 copy to planet-latest.osm.pbf after it is mirrored.

3) More Robust Option When syncing a new planet-YYMMDD.osm.pbf and if is the latest, do an MD5 check against https://planet.osm.org/pbf/planet-latest.osm.pbf.md5, if it matches perform an S3 copy, if it is different launch a separate batch job to pull down and sync https://planet.osm.org/pbf/planet-latest.osm.pbf

If there is a different or better way this could be done, let me know.

Not sure if/when I'll have a chance to take a stab at this PR but would like to if I can squeeze it in. Hopefully, these questions/design approach may help someone else that may want to try a PR as well.

tehnrd commented 5 years ago

@mojodna I just submitted #25 for this. I went with option 2 as that appears to be the same method used for copying the .orc files.

This is all a bit new to me so totally open to feedback on the PR, but would definitely like to see this bucket start to include planet-latest.osm.pbf as it would simplify consuming scripts that need the latest planet file.

mojodna commented 5 years ago

First, apologies for the delay in getting back to you. I've been slammed and have fallen behind on a number of things.

I agree with you that option 2 is the most appropriate in this context. Option 1 is certainly cheap, but I do feel a bit weird about pulling an extra 44+ GB from outside of AWS (both for planet.osm.org's sake and due to the time it takes).

The approach in your PR makes sense, but the async library has stopped providing any readability benefits here and we have the opportunity to modernize and use async / await, so I'd be inclined to expand the scope slightly and do a rewrite of that (p-map, p-limit, and friends may be helpful for that).

tehnrd commented 5 years ago

Agree, took me a while to wrap my head around the async library flow and wasn't thrilled with the results but was trying to keep changes to a minimum using the current approach.

I saw your feedback on the PR will make an attempt at cleaning it up with async/await.

mojodna commented 11 months ago

This now happens after some upgrades to the syncing process this week.