m-lab / mlab-vis-pipeline

M-Lab Visualization Dataflow pipelines for transforming ndt.all into the needed aggregation tables in bigtable.
2 stars 4 forks source link

All old versions of every data point remain in bigtable #43

Open pboothe opened 7 years ago

pboothe commented 7 years ago

All old versions of every data point remain in bigtable. To see this, try:

cbt --project mlab-oti --instance mlab-data-viz-prod read client_asn_by_day_hour count=1

and the output will look like

----------------------------------------
AS10      |2010-07-12|17        
  data:count                               @ 2017/04/11-15:04:01.015000
    "3"
  data:count                               @ 2017/03/22-17:28:32.574000
    "3"
  data:count                               @ 2017/03/21-06:20:34.701000
    "3"
  data:count                               @ 2017/03/08-02:08:45.414000
    "3"
  data:count                               @ 2017/03/06-20:20:39.653000
    "3"
  data:upload_speed_mbps_median            @ 2017/04/11-15:04:01.015000
    "?\xad\\:\x8a\xd90\x96"
  data:upload_speed_mbps_median            @ 2017/03/22-17:28:32.574000
    "?\xad\\:\x8a\xd90\x96"
  data:upload_speed_mbps_median            @ 2017/03/21-06:20:34.701000
    "?\xad\\:\x8a\xd90\x96"
  data:upload_speed_mbps_median            @ 2017/03/08-02:08:45.414000
    "?\xad\\:\x8a\xd90\x96"
  data:upload_speed_mbps_median            @ 2017/03/06-20:20:39.653000
    "?\xad\\:\x8a\xd90\x96"
  meta:client_asn_name                     @ 2017/04/11-15:04:01.015000
    "Coordination and Information Center (CSNET-CIC)"
  meta:client_asn_name                     @ 2017/03/22-17:28:32.574000
    "Coordination and Information Center (CSNET-CIC)"
  meta:client_asn_name                     @ 2017/03/21-06:20:34.701000
    "Coordination and Information Center (CSNET-CIC)"
  meta:client_asn_name                     @ 2017/03/08-02:08:45.414000
    "Coordination and Information Center (CSNET-CIC)"
  meta:client_asn_name                     @ 2017/03/06-20:20:39.653000
    "Coordination and Information Center (CSNET-CIC)"
  meta:client_asn_number                   @ 2017/04/11-15:04:01.015000
    "AS10"
  meta:client_asn_number                   @ 2017/03/22-17:28:32.574000
    "AS10"
  meta:client_asn_number                   @ 2017/03/21-06:20:34.701000
    "AS10"
  meta:client_asn_number                   @ 2017/03/08-02:08:45.414000
    "AS10"
  meta:client_asn_number                   @ 2017/03/06-20:20:39.653000
    "AS10"
  meta:date                                @ 2017/03/06-20:20:39.653000
    "2010-07-12"
  meta:date                                @ 2017/03/22-17:28:32.574000
    "2010-07-12"
  meta:date                                @ 2017/03/21-06:20:34.701000
    "2010-07-12"
  meta:date                                @ 2017/03/08-02:08:45.414000
    "2010-07-12"
  meta:date                                @ 2017/04/11-15:04:01.015000
    "2010-07-12"
  meta:hour                                @ 2017/04/11-15:04:01.015000
    "17"
  meta:hour                                @ 2017/03/22-17:28:32.574000
    "17"
  meta:hour                                @ 2017/03/21-06:20:34.701000
    "17"
  meta:hour                                @ 2017/03/08-02:08:45.414000
    "17"
  meta:hour                                @ 2017/03/06-20:20:39.653000
    "17"

Which indicates that we have one data point saved for every successful run of the data pipeline. It is good to have one backup, prudent to have two, acceptable to have three, and silly to have more. Even the dutch only use at most three layers of dikes to prevent their entire country from flooding.

iros commented 7 years ago

Ah! That's excellent to know. We had assumed that a duplicate key would just overwrite the previous value. Will need to investigate how to set an expiry date or something similar.