mojodna / osm-pds-pipelines

OSM PDS pipeline
https://quay.io/repository/mojodna/osm-pds-pipelines
ISC License
32 stars 4 forks source link

Stop storing historical PBFs #50

Open mojodna opened 3 months ago

mojodna commented 3 months ago

These are (now) already in S3 (managed by OSMF), but in a different region (eu-central-1, us-west-2) vs. osm-pds (us-east-1). As a result, I think it makes sense for this project to focus on cloud-oriented derivatives rather than being a cloud-based mirror.

This will require documentation (in RODA, here, and within the S3 bucket itself) and a gradual ramp-down of available files to minimize impact. To track/facilitate this change, we'll want log queries (+ dashboards?) showing the number of requests to historical PBFs (vs. whatever's latest).

We may also be able to work with the AWS Open Data team to reach out to customers using this bucket and point them to the OSMF-maintained one.

jeronimoluza commented 2 months ago

Hi @mojodna! I'm trying to obtain the OSM planet ORC files for years 2018-2024 from the s3://osm-pds/ bucket, but I only see the 2024/ folder when I do aws s3 ls --no-sign-request s3://osm-pds/.

Is this issue related to that? Thanks!

mojodna commented 2 months ago

Only slightly. I imagine that we'd continue to keep archival ORC (or Parquet) files for some time period. Right now, previous versions are deleted after 90 days for 2 reasons:

  1. PBFs remain available from the OSMF-maintained mirror
  2. the history ORC file can be used to derive snapshots for any point in the past. I don't the SQL off-hand to do this, but if you need it (or are just interested), I can resurrect it.
jeronimoluza commented 2 months ago

It would be of great help if I could take a look at that SQL to derive the snapshots using the planet-history, if it is not much of problem for you. Let me know if there's anything I can to help. And thanks again!

mojodna commented 2 months ago

This is how I derive snapshots from the history table:

WITH with_valid_until AS (
  SELECT *,
    -- derive a valid_until column based on the timestamp of the next update;
    -- it will be null if it's current
    LEAD(timestamp, 1) OVER (
      PARTITION BY type,
      id
      ORDER BY version ASC
    ) AS valid_until
  FROM planet_history
)
SELECT *
FROM with_valid_until
WHERE TIMESTAMP '2009-02-28 12:00:00' -- snapshot timestamp
  BETWEEN timestamp AND valid_until
jeronimoluza commented 2 months ago

Thank you very much, this is very helpful!