openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Tile log retention and aggregation #698

Open pnorman opened 2 years ago

pnorman commented 2 years ago

219 is about log retention in general, but I want to split off the tile service because it is such a high volume service with 4TB/month of logs. This means there are technical reasons to aggregate data aside from privacy, as that is a lot of data.

Log type Format Contains Uses Proposed retention
Raw CDN logs GZIP CSV Full information on all requests
  • Debugging
  • logs with no delay
30 days
Raw render logs GZIP Logs Full information on cache misses Debugging 30 days
Successful request data Parquet
  • Second-precision time
  • Tile
  • IP
  • Browser headers
  • Rough location and network
  • Response information
General log analysis 90 days
Reduced precision data Parquet
  • Minute-precision time
  • Zoom
  • Browser headers
  • Rough location and network
Analysis without tile details
Detailed usage patterns
720 days
Historical data Parquet
  • Hour-precision time
  • Tile
  • User-agent
  • IPv4/IPv6
  • Rough location and network
Historical analysis of usage patterns Indefinite
Tile logs txt.xz # of times tiles accessed Published indefinite
Referer logs csv Top websites accessing Published indefinite
App logs csv Top apps accessing Published indefinite
pnorman commented 1 year ago

Raw CDN logs are now retained for 30 days and successful requests turned into parquet logs.

Some raw render server logs are retained for 30 days, but I still see others over 30 days on th eservers.

I haven't looked at the reduced precision and historical data generation yet.

pnorman commented 1 year ago

I'm now generating two reduced logs - one that drops tile info and retains other info so bunch of tile requests from the same user appear as one row (including how many tiles were request) and another that drops user-specific information like IP and retains tile details, so in future we would be able to generate tile usage details.

I need to back-populate these tables, then retention periods can be adjusted.