CARMA Analytics is a secure, scalable and cloud agnostic data architecture to enable research analysis on big data sets by defining repeatable and scalable processes for ingesting data, creation of a data repository to fuse, process and perform quality assurance.
Apache License 2.0
3
stars
3
forks
source link
Change frequency calculation to take interval average #30
Currently scripts to plot message frequency for CARMA Streets messages in simulation have an incorporated bias. First they associate a simulation time with each message by looking at the Wall Time for when simulation time steps happen and the Wall Time for when messages are sent. Using the message timestamp is not adequate because services can fall behind in consuming/processing time steps but this will not be evident the the message time stamp data since these are processed sequentially by the service.
The scripts then calculate instantaneous message interval ( difference in message simulation timestamps ) and convert that to instantaneous frequency. Then it plots a rolling average of this instantaneous frequency.
The issue arises that in simulation if individual services process simulation a simulation time step fast or slow. Consider the following scenario. A service produces a message late and simulation has already gone to the next time step, but then this service is able to produce the message for the current time step as well, effectively catching up. This will be reflected in the data as a simulation time step without a message and a simulation step with 2 messages.
With the current scripts we will get we will get an instantaneous interval of 0.2 s and 0 s respectively. When converting the 0 second interval to frequency it will be lost since 1/0 is infinite and by default considered NA in pandas data frame. If we then take the frequencies over this period we will have 5 Hz and NA which will average to 5 Hz even though we sent 3 messages in 0.3 seconds which should be 10 Hz. Example log to show the effect (see the system time is equal):
[2024-03-12 18:30:27.576] [debug] [streets_service.cpp:96] Consumed: {"timestep":18600,"seq":187}
[2024-03-12 18:30:27.576] [info] [streets_service.cpp:103] Simulation Time: 18600 where current system time is: 1710282627576
[2024-03-12 18:30:27.576] [debug] [streets_service.cpp:96] Consumed: {"timestep":18700,"seq":188}
[2024-03-12 18:30:27.576] [info] [streets_service.cpp:103] Simulation Time: 18700 where current system time is: 1710282627576
The way to correct for this is to calculate a rolling average interval and convert that to frequency instead. This will guarantee that we incorporate the higher frequency data of 2 messages sent in a single timestep.
Additionally window size and min period difference when calculating rolling average were making the data skew lower at the start and end. To understand this consider the beginning part of a data set. Window size the moving window over which the rolling average will be computed. Min period is an optional parameter to allow averages to be computed for portions of the data where a full window size is not available. Having a large window size relative to min period means that values that fall into the minimum period size will contribute more to the resulting data that values that do not. Consider the following data
0,4,5,4,4,6,4,5,4,8
If we take a rolling average with window size 10 but min period 1 we will get the following
0,2,3,3.25,3.4,3.83,3.85,4,4,4.4
This average data is far lower than the original data. That is because the first entry was low and contributed 100 % to the first value, 50 % to the second value, 33% to the 3rd value and so on. In comparison each value after the window size is reached will only contribute 10% to 10 values. By removing min period and considering window size and min period we can guarantee equal contribution by all values to our rolling average values.
Also added plot legend
Related GitHub Issue
Related Jira Key
CDAR-847
Motivation and Context
Fix data analysis scripts
How Has This Been Tested?
Using data collected from CDASim deployment
Types of changes
[x] Defect fix (non-breaking change that fixes an issue)
[ ] New feature (non-breaking change that adds functionality)
[ ] Breaking change (fix or feature that cause existing functionality to change)
Checklist:
[ ] I have added any new packages to the sonar-scanner.properties file
[x] My change requires a change to the documentation.
PR Details
Description
Currently scripts to plot message frequency for CARMA Streets messages in simulation have an incorporated bias. First they associate a simulation time with each message by looking at the Wall Time for when simulation time steps happen and the Wall Time for when messages are sent. Using the message timestamp is not adequate because services can fall behind in consuming/processing time steps but this will not be evident the the message time stamp data since these are processed sequentially by the service.
The scripts then calculate instantaneous message interval ( difference in message simulation timestamps ) and convert that to instantaneous frequency. Then it plots a rolling average of this instantaneous frequency.
The issue arises that in simulation if individual services process simulation a simulation time step fast or slow. Consider the following scenario. A service produces a message late and simulation has already gone to the next time step, but then this service is able to produce the message for the current time step as well, effectively catching up. This will be reflected in the data as a simulation time step without a message and a simulation step with 2 messages.
With the current scripts we will get we will get an instantaneous interval of 0.2 s and 0 s respectively. When converting the 0 second interval to frequency it will be lost since 1/0 is infinite and by default considered NA in pandas data frame. If we then take the frequencies over this period we will have 5 Hz and NA which will average to 5 Hz even though we sent 3 messages in 0.3 seconds which should be 10 Hz. Example log to show the effect (see the system time is equal):
The way to correct for this is to calculate a rolling average interval and convert that to frequency instead. This will guarantee that we incorporate the higher frequency data of 2 messages sent in a single timestep.
Additionally window size and min period difference when calculating rolling average were making the data skew lower at the start and end. To understand this consider the beginning part of a data set. Window size the moving window over which the rolling average will be computed. Min period is an optional parameter to allow averages to be computed for portions of the data where a full window size is not available. Having a large window size relative to min period means that values that fall into the minimum period size will contribute more to the resulting data that values that do not. Consider the following data
0,4,5,4,4,6,4,5,4,8
If we take a rolling average with window size 10 but min period 1 we will get the following0,2,3,3.25,3.4,3.83,3.85,4,4,4.4
This average data is far lower than the original data. That is because the first entry was low and contributed 100 % to the first value, 50 % to the second value, 33% to the 3rd value and so on. In comparison each value after the window size is reached will only contribute 10% to 10 values. By removing min period and considering window size and min period we can guarantee equal contribution by all values to our rolling average values. Also added plot legendRelated GitHub Issue
Related Jira Key
CDAR-847
Motivation and Context
Fix data analysis scripts
How Has This Been Tested?
Using data collected from CDASim deployment
Types of changes
Checklist: