noi-techpark / bdp-commons

GNU Affero General Public License v3.0
2 stars 12 forks source link

As an air quality data scientist I would like to improve the elaboration workflow of the data collected by the low-cost innovative air quality sensors of A22 #286

Closed rcavaliere closed 3 years ago

rcavaliere commented 3 years ago

pres_24_06.pdf Connected to the discussion in #265

UNITN has made a comparison between how we currently compute the elaborations (calibration) of the raw measurement and the desired approach they recommend, which is:

The difference is not negligible (see PDF with results of the analysis they have made). The recommendation is therefore to implement their recommended approach. By the way, lot of problems we faced with the calculations that in some cases were not possible to compute are related to this non-optimal approach we implemented

Tasks:

bertolla commented 3 years ago

@rcavaliere Question: Are these the correct data types?

 CO2_raw
 CO_raw
 NO2-Alphasense_processed
 NO2-Alphasense_raw
 NO2-Orion_raw
 NO2_raw
 NO-Alphasense_processed
 NO-Alphasense_raw
 O3_processed
 O3_raw
 PM10_processed
 PM10_raw
 PM2.5_processed
 PM2.5_raw
 RH_raw
 temperature-external_raw
 temperature-internal_raw
 VOC_raw

Is there a reason why NO and NO2 have no elaborations, but only their cousins 'Alphasense'?

rcavaliere commented 3 years ago

@bertolla yes they are. Data types with appendix "_processed" are those that need to be computed the non linear transformations (complex formulas) on the hourly averages (only).

Summarizing again: "_raw" -> "_raw" (hourly averages) -> _processed (complex formulas)

NO2 and NO sensors: after some comparisons UNITN has decided to focus just on Alphasense sensors because they are the better ones. For the other sensors at the moment we just collect the raw data, maybe in the future we will use it as well.

bertolla commented 3 years ago

Do we have some sample data for a calculation with 1 hour?

rcavaliere commented 3 years ago

what do you mean? We continue to receive data from algorab!

bertolla commented 3 years ago

sure but we do not have sample data of processed data on one hour. I would like to write some tests to be sure that the calculations are done correctly

rcavaliere commented 3 years ago

Ok, I can ask the University of Trento to make some tests that we can automate. Can you in the meantime implement everything? They can use then analytics to check if everything is correct

bertolla commented 3 years ago

These 2 queries are to eliminate all processed data:

delete from measurementhistory where station_id in (select id from station where origin='a22-algorab' and stationtype='EnvironmentStation') and type_id in (select distinct type_id from station s join measurement m on m.station_id=s.id join type t on t.id =m.type_id where origin='a22-algorab' and stationtype='EnvironmentStation' and cname like '%processed');`
delete from measurement where station_id in (select id from station where origin='a22-algorab' and stationtype='EnvironmentStation') and type_id in (select distinct type_id from station s join measurement m on m.station_id=s.id join type t on t.id =m.type_id where origin='a22-algorab' and stationtype='EnvironmentStation' and cname like '%processed');
bertolla commented 3 years ago

Ok, I can ask the University of Trento to make some tests that we can automate. Can you in the meantime implement everything? They can use then analytics to check if everything is correct

I'm nearly finished, I'll just need to check if my calculations are right

bertolla commented 3 years ago

All calculations on the period 3600 are already in the test database, so maybe you can have a look at it, once you're back. I did not have anything to check my calcs therefore I predict a lot of wrong data ;-)

bertolla commented 3 years ago

First example on analytics you can check: https://analytics.opendatahub.testingmachine.eu/#%7B%22active_tab%22:0,%22height%22:%22400px%22,%22auto_refresh%22:false,%22scale%22:%7B%22from%22:1623535200000,%22to%22:1626213600000%7D,%22graphs%22:%5B%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM10_raw%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:3%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM10_processed%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:4%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM2.5_processed%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:5%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM2.5_raw%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:6%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM2.5_raw%22,%22unit%22:%22%22,%22period%22:%22600%22,%22yaxis%22:1,%22color%22:7%7D%5D%7D

rcavaliere commented 3 years ago

@bertolla I have tested this a little bit. The averages look good, while the complex elaborations (let's call them "calibrations") have issues, since I see frequent negative values. Can you give me the link to the source code for the calibrations? Maybe I can also check if mathematically everything is done correctly

Look at this example:

id created_on period timestamp double_value provenance_id station_id type_id
842,913 2021-07-08 14:54:47 3,600 2021-06-17 12:00:00 -45.9129207339 68 343,121 6,016
842,914 2021-07-08 14:54:52 3,600 2021-06-17 12:00:00 -134.0229445809 68 343,121 6,015
774,492 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 85.5444010606 64 343,121 6,014
428,095 2020-06-19 12:20:13 600 2021-06-17 12:47:09 91.0020449898 63 343,121 6,014
774,491 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 90.7878787879 64 343,121 6,013
428,068 2020-06-18 12:51:26 600 2021-06-17 12:47:09 92 63 343,121 6,013
774,490 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 88.9696969697 64 343,121 6,012
428,069 2020-06-18 12:51:27 600 2021-06-17 12:47:09 88 63 343,121 6,012
774,489 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 42.6878787879 64 343,121 6,011
428,097 2020-06-19 12:20:16 600 2021-06-17 12:47:09 40.6 63 343,121 6,011
774,488 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 32.4212121212 64 343,121 6,010
428,096 2020-06-19 12:20:16 600 2021-06-17 12:47:09 32.5 63 343,121 6,010
842,917 2021-07-08 14:55:06 3,600 2021-06-17 12:00:00 22.6164627155 68 343,121 5,965
774,486 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 10.273280303 64 343,121 5,964
427,774 2020-06-05 14:45:13 600 2021-06-17 12:47:09 11.0109489051 63 343,121 5,964
842,916 2021-07-08 14:55:01 3,600 2021-06-17 12:00:00 29.1605479419 68 343,121 5,963
774,484 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 11.7890909091 64 343,121 5,962
427,772 2020-06-05 14:45:13 600 2021-06-17 12:47:09 15.06 63 343,121 5,962
774,483 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 521.4848484848 64 343,121 5,960
427,917 2020-06-08 09:23:26 600 2021-06-17 12:47:09 513 63 343,121 5,960
774,482 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 482.940849 64 343,121 5,958
427,764 2020-06-05 14:45:13 600 2021-06-17 12:47:09 482.940848797 63 343,121 5,958
774,481 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 216.4793939394 64 343,121 5,956
427,766 2020-06-05 14:45:13 600 2021-06-17 12:47:09 211.23 63 343,121 5,956
842,915 2021-07-08 14:54:57 3,600 2021-06-17 12:00:00 406.2479630219 68 343,121 5,955
774,479 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 157.5548458485 64 343,121 5,954
427,770 2020-06-05 14:45:13 600 2021-06-17 12:47:09 138.1315520945 63 343,121 5,954
774,478 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 59.0121212121 64 343,121 5,952
427,915 2020-06-08 09:23:09 600 2021-06-17 12:47:09 54.4 63 343,121 5,952
774,477 2021-04-15 15:25:19 3,600 2021-06-17 12:00:00 403.3333333333 64 343,121 5,948
427,762 2020-06-05 14:45:13 600 2021-06-17 12:47:09 406 63 343,121 5,948

where

id cname created_on cunit description rtype meta_data_id
5,955 O3_processed 2020-02-25 12:50:03   O3 Mean 260
5,963 PM10_processed 2020-02-25 12:50:03   PM10 Mean 268
5,965 PM2.5_processed 2020-02-25 12:50:03   PM2.5 Mean 270
5,948 CO2_raw 2020-02-25 12:50:03 [NULL] CO2 Mean 253
6,010 temperature-external_raw 2020-06-18 12:15:00 [NULL] Temperatura Mean 324
5,952 RH_raw 2020-02-25 12:50:03 [NULL] Umidità relativa Mean 257
6,011 temperature-internal_raw 2020-06-18 12:15:00 [NULL] Temperatura interna Mean 325
5,954 O3_raw 2020-02-25 12:50:03 [NULL] O3 Mean 259
5,958 CO_raw 2020-02-25 12:50:03 [NULL] CO Mean 263
6,015 NO-Alphasense_processed 2020-06-18 12:15:00   NO (Alphasense) Mean 329
6,016 NO2-Alphasense_processed 2020-06-18 12:15:00   NO2 (Alphasense) Mean 330
5,960 VOC_raw 2020-02-25 12:50:03 [NULL] VOC Mean 265
5,962 PM10_raw 2020-02-25 12:50:03 [NULL] PM10 Mean 267
5,964 PM2.5_raw 2020-02-25 12:50:03 [NULL] PM2.5 Mean 269
6,012 NO2-Alphasense_raw 2020-06-18 12:15:00 [NULL] NO2 (Alphasense) Mean 326
6,013 NO-Alphasense_raw 2020-06-18 12:15:00 [NULL] NO (Alphasense) Mean 327
6,014 NO2-Orion_raw 2020-06-18 12:15:00 [NULL] NO2 (Orion) Mean 328
5,956 NO2_raw 2020-02-25 12:50:03 [NULL] NO2 Mean 261
rcavaliere commented 3 years ago

https://github.com/noi-techpark/bdp-elaborations/blob/2e26b1060d45b187f33fb4cb98928a69d76dba65/Environment-A22-Processing/src/dataprocessor/DataProcessor.py#L60

rcavaliere commented 3 years ago

@bertolla wonderful news. I completed a significant test (see attachment) and in my opinion the processing computations are implemented correctly. The only thing that needs to be added is the following rule, to be applied for all calibrations: if (processed_value < 0) then (processed_value = 0) Once this improvement is implemented please run this again the testing environment and let's have a look if we don't have negative values anymore. Then we can go in production. My proposal here is to delete all historical processed values (let's keep only the raw measurements) and run these new elaborations from the beginning of the history we have. The only bad thing is that the calibration coefficients have changed in time. I don't know if we are able to apply to correct calibration coefficients depending on the measurement time... maybe let's discuss this if this is not so complicated. With the JSON file you showed me, now this should be easier to manage. Test.csv

bertolla commented 3 years ago

Well, ok, but I would'nt save it at all at this point if it makes no sense :confused:

rcavaliere commented 3 years ago

what do you intend?

bertolla commented 3 years ago

If the negative value makes no sense at all why even bother to save something else than?

rcavaliere commented 3 years ago

No, they have sense. No value means no measurement, in this sense we have a raw measurement that has the meaning of a very low concentration. So please save it as requested

rcavaliere commented 3 years ago

@bertolla I have tested the '0' presence, everything is fine. What is the status with the implementation of the calculation of the averages as lambda function?

bertolla commented 3 years ago

Well the averages as a separated module already exists as a java app, running on docker. If Lukas M. has time he will try to re implement it as lambda function. For now the whole chain is already working as expected.

rcavaliere commented 3 years ago

OK, can we in the meantime go into production?

bertolla commented 3 years ago

Yes, sure. Move to user-story and I'll do it