openaq / openaq-ingestor

1 stars 1 forks source link

Multiple providers for one source leads to issues with data exports #8

Open caparker opened 1 year ago

caparker commented 1 year ago

Because of the way that we are linking locations (nodes) to providers and then using that information in the data export paths we are seeing situations where a location is flipping back and forth between different 'folders' in our export file structure.

A good example of this can be seen in location 8047. This location is part of the AirNow/StateAir networks and is ingested (historically) under both. Because of the way that data is updated for realtime ingestion we are seeing the provider move back and forth between AirNow and StateAir, so, for a given month, this is what the exports look like

                                               key                                                |    day     |          exported_on          
-------------------------------------------------------------------------------------------------------+------------+-------------------------------
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220101   | 2022-01-01 | 2022-06-06 05:53:14.690365+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220102   | 2022-01-02 | 2022-06-06 05:56:20.485503+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220103   | 2022-01-03 | 2022-06-06 05:57:45.054661+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220104   | 2022-01-04 | 2022-06-06 05:58:09.662885+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220105   | 2022-01-05 | 2022-06-06 06:01:43.863835+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220106   | 2022-01-06 | 2022-06-06 06:04:19.938678+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220107   | 2022-01-07 | 2022-06-06 06:01:54.443595+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220108   | 2022-01-08 | 2022-06-06 06:05:58.245287+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220109   | 2022-01-09 | 2022-06-06 06:08:31.571867+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220110   | 2022-01-10 | 2022-06-06 06:12:29.719793+00
 records/csv.gz/provider=stateair/country=sd/locationid=8047/year=2022/month=01/location-8047-20220111 | 2022-01-11 | 2022-06-06 06:16:20.613882+00
 records/csv.gz/provider=stateair/country=sd/locationid=8047/year=2022/month=01/location-8047-20220112 | 2022-01-12 | 2022-06-06 06:16:39.314988+00
 records/csv.gz/provider=stateair/country=sd/locationid=8047/year=2022/month=01/location-8047-20220113 | 2022-01-13 | 2022-06-06 06:17:30.131499+00
 records/csv.gz/provider=stateair/country=sd/locationid=8047/year=2022/month=01/location-8047-20220114 | 2022-01-14 | 2022-06-06 06:19:12.926728+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220115   | 2022-01-15 | 2022-06-06 06:21:51.057098+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220116   | 2022-01-16 | 2022-06-06 06:24:40.222024+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220117   | 2022-01-17 | 2022-06-06 06:24:01.474895+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220118   | 2022-01-18 | 2022-06-06 06:26:04.872786+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220119   | 2022-01-19 | 2022-06-06 06:29:09.344685+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220120   | 2022-01-20 | 2022-06-06 06:27:48.974003+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220121   | 2022-01-21 | 2022-06-06 06:30:50.746665+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220122   | 2022-01-22 | 2022-06-06 06:33:38.278047+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220123   | 2022-01-23 | 2022-06-06 06:32:24.575535+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220124   | 2022-01-24 | 2022-06-06 06:35:35.183232+00
 records/csv.gz/provider=stateair/country=sd/locationid=8047/year=2022/month=01/location-8047-20220125 | 2022-01-25 | 2022-05-26 23:53:19.242192+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220126   | 2022-01-26 | 2022-06-06 06:40:01.519439+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220127   | 2022-01-27 | 2022-06-06 06:39:21.093998+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220128   | 2022-01-28 | 2022-06-06 06:41:35.865619+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220129   | 2022-01-29 | 2022-06-06 06:44:53.979586+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220130   | 2022-01-30 | 2022-06-06 06:43:13.894873+00
 records/csv.gz/provider=airnow/country=su/locationid=8047/year=2022/month=01/location-8047-20220131   | 2022-01-31 | 2022-06-06 06:46:20.233677+00

We can see this in the sensor_nodes_history table. Here is a summary of the last 30 days which shows how many times the node was updated and which source_name it was updated to. A day with two sources is a day that it switched at least once

  created   |               sources                | n  
------------+--------------------------------------+----
 2023-04-11 | AirNow                               | 48
 2023-04-10 | AirNow, StateAir_KhartoumResidential | 71
 2023-04-09 | AirNow, StateAir_KhartoumResidential | 71
 2023-04-08 | AirNow, StateAir_KhartoumResidential | 70
 2023-04-07 | AirNow, StateAir_KhartoumResidential | 69
 2023-04-06 | AirNow, StateAir_KhartoumResidential | 46
 2023-04-05 | AirNow, StateAir_KhartoumResidential | 66
 2023-04-04 | AirNow, StateAir_KhartoumResidential | 74
 2023-04-03 | AirNow, StateAir_KhartoumResidential | 72
 2023-04-02 | AirNow                               | 72
 2023-04-01 | AirNow                               | 72
 2023-03-31 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-30 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-29 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-28 | AirNow, StateAir_KhartoumResidential | 70
 2023-03-27 | AirNow                               | 72
 2023-03-26 | AirNow                               | 72
 2023-03-25 | AirNow                               | 72
 2023-03-24 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-23 | AirNow                               | 70
 2023-03-22 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-21 | AirNow                               | 72
 2023-03-20 | AirNow                               | 48
 2023-03-19 | AirNow, StateAir_KhartoumResidential | 24
 2023-03-18 | AirNow, StateAir_KhartoumResidential | 37
 2023-03-17 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-16 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-15 | AirNow                               | 72
 2023-03-14 | AirNow, StateAir_KhartoumResidential | 72
 2023-03-13 | AirNow, StateAir_KhartoumResidential | 72

This summary is from production but staging is similar

The solution to part of this is going to be the source/providers updates we have been talking about. The other issue, for example the node being updated every time we are ingesting will require some reworking of how we are comparing nodes on ingest.