As an air quality data scientist I would like to improve the elaboration workflow of the data collected by the low-cost innovative air quality sensors of A22

rcavaliere commented 3 years ago

pres_24_06.pdf Connected to the discussion in #265

UNITN has made a comparison between how we currently compute the elaborations (calibration) of the raw measurement and the desired approach they recommend, which is:

compute the hourly averages applied to raw measurements
compute the non linear transformations (complex formulas) on the hourly averages (only)

The difference is not negligible (see PDF with results of the analysis they have made). The recommendation is therefore to implement their recommended approach. By the way, lot of problems we faced with the calculations that in some cases were not possible to compute are related to this non-optimal approach we implemented

Tasks:

[x] separate components: Data Collector (storage of raw measurements), Elaboration nr.1 (Averages Calculation), Elaboration nr.2 (Calibration with complex formulas)
[x] clean database: keep only raw measurements
[x] compute all elaborations (averages, calibrations - only on the hourly averages!) from the the beginning of the time series
[ ] also averages as lambda functions
[x] put on production: just elaborations from last calibration coefficient change

bertolla commented 3 years ago

@rcavaliere Question: Are these the correct data types?

 CO2_raw
 CO_raw
 NO2-Alphasense_processed
 NO2-Alphasense_raw
 NO2-Orion_raw
 NO2_raw
 NO-Alphasense_processed
 NO-Alphasense_raw
 O3_processed
 O3_raw
 PM10_processed
 PM10_raw
 PM2.5_processed
 PM2.5_raw
 RH_raw
 temperature-external_raw
 temperature-internal_raw
 VOC_raw

Is there a reason why NO and NO2 have no elaborations, but only their cousins 'Alphasense'?

rcavaliere commented 3 years ago

@bertolla yes they are. Data types with appendix "_processed" are those that need to be computed the non linear transformations (complex formulas) on the hourly averages (only).

Summarizing again: "_raw" -> "_raw" (hourly averages) -> _processed (complex formulas)

NO2 and NO sensors: after some comparisons UNITN has decided to focus just on Alphasense sensors because they are the better ones. For the other sensors at the moment we just collect the raw data, maybe in the future we will use it as well.

bertolla commented 3 years ago

Do we have some sample data for a calculation with 1 hour?

rcavaliere commented 3 years ago

what do you mean? We continue to receive data from algorab!

bertolla commented 3 years ago

sure but we do not have sample data of processed data on one hour. I would like to write some tests to be sure that the calculations are done correctly

rcavaliere commented 3 years ago

Ok, I can ask the University of Trento to make some tests that we can automate. Can you in the meantime implement everything? They can use then analytics to check if everything is correct

bertolla commented 3 years ago

These 2 queries are to eliminate all processed data:

delete from measurementhistory where station_id in (select id from station where origin='a22-algorab' and stationtype='EnvironmentStation') and type_id in (select distinct type_id from station s join measurement m on m.station_id=s.id join type t on t.id =m.type_id where origin='a22-algorab' and stationtype='EnvironmentStation' and cname like '%processed');`
delete from measurement where station_id in (select id from station where origin='a22-algorab' and stationtype='EnvironmentStation') and type_id in (select distinct type_id from station s join measurement m on m.station_id=s.id join type t on t.id =m.type_id where origin='a22-algorab' and stationtype='EnvironmentStation' and cname like '%processed');

bertolla commented 3 years ago

Ok, I can ask the University of Trento to make some tests that we can automate. Can you in the meantime implement everything? They can use then analytics to check if everything is correct

I'm nearly finished, I'll just need to check if my calculations are right

bertolla commented 3 years ago

All calculations on the period 3600 are already in the test database, so maybe you can have a look at it, once you're back. I did not have anything to check my calcs therefore I predict a lot of wrong data ;-)

bertolla commented 3 years ago

First example on analytics you can check: https://analytics.opendatahub.testingmachine.eu/#%7B%22active_tab%22:0,%22height%22:%22400px%22,%22auto_refresh%22:false,%22scale%22:%7B%22from%22:1623535200000,%22to%22:1626213600000%7D,%22graphs%22:%5B%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM10_raw%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:3%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM10_processed%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:4%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM2.5_processed%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:5%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM2.5_raw%22,%22unit%22:%22%22,%22period%22:%223600%22,%22yaxis%22:1,%22color%22:6%7D,%7B%22category%22:%22Air%20quality%22,%22station%22:%22AUGEG4_AIRQ01%22,%22station_name%22:%22103.700_APPA%20BZ%22,%22data_type%22:%22PM2.5_raw%22,%22unit%22:%22%22,%22period%22:%22600%22,%22yaxis%22:1,%22color%22:7%7D%5D%7D

rcavaliere commented 3 years ago

@bertolla I have tested this a little bit. The averages look good, while the complex elaborations (let's call them "calibrations") have issues, since I see frequent negative values. Can you give me the link to the source code for the calibrations? Maybe I can also check if mathematically everything is done correctly

Look at this example:

id	created_on	period	timestamp	double_value	provenance_id	station_id	type_id
842,913	2021-07-08 14:54:47	3,600	2021-06-17 12:00:00	-45.9129207339	68	343,121	6,016
842,914	2021-07-08 14:54:52	3,600	2021-06-17 12:00:00	-134.0229445809	68	343,121	6,015
774,492	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	85.5444010606	64	343,121	6,014
428,095	2020-06-19 12:20:13	600	2021-06-17 12:47:09	91.0020449898	63	343,121	6,014
774,491	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	90.7878787879	64	343,121	6,013
428,068	2020-06-18 12:51:26	600	2021-06-17 12:47:09	92	63	343,121	6,013
774,490	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	88.9696969697	64	343,121	6,012
428,069	2020-06-18 12:51:27	600	2021-06-17 12:47:09	88	63	343,121	6,012
774,489	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	42.6878787879	64	343,121	6,011
428,097	2020-06-19 12:20:16	600	2021-06-17 12:47:09	40.6	63	343,121	6,011
774,488	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	32.4212121212	64	343,121	6,010
428,096	2020-06-19 12:20:16	600	2021-06-17 12:47:09	32.5	63	343,121	6,010
842,917	2021-07-08 14:55:06	3,600	2021-06-17 12:00:00	22.6164627155	68	343,121	5,965
774,486	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	10.273280303	64	343,121	5,964
427,774	2020-06-05 14:45:13	600	2021-06-17 12:47:09	11.0109489051	63	343,121	5,964
842,916	2021-07-08 14:55:01	3,600	2021-06-17 12:00:00	29.1605479419	68	343,121	5,963
774,484	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	11.7890909091	64	343,121	5,962
427,772	2020-06-05 14:45:13	600	2021-06-17 12:47:09	15.06	63	343,121	5,962
774,483	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	521.4848484848	64	343,121	5,960
427,917	2020-06-08 09:23:26	600	2021-06-17 12:47:09	513	63	343,121	5,960
774,482	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	482.940849	64	343,121	5,958
427,764	2020-06-05 14:45:13	600	2021-06-17 12:47:09	482.940848797	63	343,121	5,958
774,481	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	216.4793939394	64	343,121	5,956
427,766	2020-06-05 14:45:13	600	2021-06-17 12:47:09	211.23	63	343,121	5,956
842,915	2021-07-08 14:54:57	3,600	2021-06-17 12:00:00	406.2479630219	68	343,121	5,955
774,479	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	157.5548458485	64	343,121	5,954
427,770	2020-06-05 14:45:13	600	2021-06-17 12:47:09	138.1315520945	63	343,121	5,954
774,478	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	59.0121212121	64	343,121	5,952
427,915	2020-06-08 09:23:09	600	2021-06-17 12:47:09	54.4	63	343,121	5,952
774,477	2021-04-15 15:25:19	3,600	2021-06-17 12:00:00	403.3333333333	64	343,121	5,948
427,762	2020-06-05 14:45:13	600	2021-06-17 12:47:09	406	63	343,121	5,948

where

id	cname	created_on	cunit	description	rtype	meta_data_id
5,955	O3_processed	2020-02-25 12:50:03		O3	Mean	260
5,963	PM10_processed	2020-02-25 12:50:03		PM10	Mean	268
5,965	PM2.5_processed	2020-02-25 12:50:03		PM2.5	Mean	270
5,948	CO2_raw	2020-02-25 12:50:03	[NULL]	CO2	Mean	253
6,010	temperature-external_raw	2020-06-18 12:15:00	[NULL]	Temperatura	Mean	324
5,952	RH_raw	2020-02-25 12:50:03	[NULL]	Umidità relativa	Mean	257
6,011	temperature-internal_raw	2020-06-18 12:15:00	[NULL]	Temperatura interna	Mean	325
5,954	O3_raw	2020-02-25 12:50:03	[NULL]	O3	Mean	259
5,958	CO_raw	2020-02-25 12:50:03	[NULL]	CO	Mean	263
6,015	NO-Alphasense_processed	2020-06-18 12:15:00		NO (Alphasense)	Mean	329
6,016	NO2-Alphasense_processed	2020-06-18 12:15:00		NO2 (Alphasense)	Mean	330
5,960	VOC_raw	2020-02-25 12:50:03	[NULL]	VOC	Mean	265
5,962	PM10_raw	2020-02-25 12:50:03	[NULL]	PM10	Mean	267
5,964	PM2.5_raw	2020-02-25 12:50:03	[NULL]	PM2.5	Mean	269
6,012	NO2-Alphasense_raw	2020-06-18 12:15:00	[NULL]	NO2 (Alphasense)	Mean	326
6,013	NO-Alphasense_raw	2020-06-18 12:15:00	[NULL]	NO (Alphasense)	Mean	327
6,014	NO2-Orion_raw	2020-06-18 12:15:00	[NULL]	NO2 (Orion)	Mean	328
5,956	NO2_raw	2020-02-25 12:50:03	[NULL]	NO2	Mean	261

rcavaliere commented 3 years ago

https://github.com/noi-techpark/bdp-elaborations/blob/2e26b1060d45b187f33fb4cb98928a69d76dba65/Environment-A22-Processing/src/dataprocessor/DataProcessor.py#L60

rcavaliere commented 3 years ago

@bertolla wonderful news. I completed a significant test (see attachment) and in my opinion the processing computations are implemented correctly. The only thing that needs to be added is the following rule, to be applied for all calibrations: if (processed_value < 0) then (processed_value = 0) Once this improvement is implemented please run this again the testing environment and let's have a look if we don't have negative values anymore. Then we can go in production. My proposal here is to delete all historical processed values (let's keep only the raw measurements) and run these new elaborations from the beginning of the history we have. The only bad thing is that the calibration coefficients have changed in time. I don't know if we are able to apply to correct calibration coefficients depending on the measurement time... maybe let's discuss this if this is not so complicated. With the JSON file you showed me, now this should be easier to manage. Test.csv

bertolla commented 3 years ago

Well, ok, but I would'nt save it at all at this point if it makes no sense :confused:

rcavaliere commented 3 years ago

what do you intend?

bertolla commented 3 years ago

If the negative value makes no sense at all why even bother to save something else than?

rcavaliere commented 3 years ago

No, they have sense. No value means no measurement, in this sense we have a raw measurement that has the meaning of a very low concentration. So please save it as requested

rcavaliere commented 3 years ago

@bertolla I have tested the '0' presence, everything is fine. What is the status with the implementation of the calculation of the averages as lambda function?

bertolla commented 3 years ago

Well the averages as a separated module already exists as a java app, running on docker. If Lukas M. has time he will try to re implement it as lambda function. For now the whole chain is already working as expected.

rcavaliere commented 3 years ago

OK, can we in the meantime go into production?

bertolla commented 3 years ago

Yes, sure. Move to user-story and I'll do it

noi-techpark / bdp-commons

As an air quality data scientist I would like to improve the elaboration workflow of the data collected by the low-cost innovative air quality sensors of A22 #286