The realtime analysis of the data from the CTA arrays poses interesting challenges to the design of a proper execution environment, that is capable of performing the required steps of data calibration, image cleaning and alert detection.
With the streams-cta project we investigate the use of highly scalable platforms, such as Apache Storm and Apache Kafka, that recently have been published by the computer science community to tackle Big Data analysis requirements. Based on the intermediate streams framework, which serves as a middle-layer design tool, we test different implementations and platforms for their performance and scalability.
The input to the realtime analysis (RTA) are calibrated CTA Array events.
These are the static images (given in estimated number of photons per camera pixel) from each telescope that triggered during the event.
Each telescopes data is stored as a simple array of doubles.
Each telescope has a unique id. Starting at 1 and counting upwards.
Image features are calculated for each camera separately.
To introduce hierarchical semantics the data is stored
according to the following naming scheme proposal:
For per camera/telescope specific image features
telescope:<id>:<feature-group-name>:<feature-name>:*
Some example for the well known hillas parameters
telescope:<id>:shower:width
telescope:<id>:shower:cog:x
telescope:<id>:shower:cog:y
MonteCarlo information that is array wide
mc:<mc-value-name>
So for example the true energy could be stored as
mc:primary_energy
Array wide information. Should contain things like the event timestamp at some point. For now just trigger information really.
array:triggered_telescope_ids
array:num_triggered_telescopes
More to come soon. Keep in mind that these are currently only proposals. This might change quickly.
The input to this program are note EventIO files but already calibrated events. These
can be produced from any EventIO file using the convert_raw_data.py
script in the python
folder. It will create a gzipped json file containing the calibrated images for all the events
in the EventIO file. Input files can be found on Konrads Bernloehrs Website.
The URLs of the files look like this for the super arrays
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/electron_20deg_0deg_run732___cta-prod3-merged_desert-2150m-Paranal-subarray-1.simtel.gz
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/proton_20deg_0deg_run8050___cta-prod3-merged_desert-2150m-Paranal-subarray-1.simtel.gz
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/gamma_20deg_0deg_run8142___cta-prod3-merged_desert-2150m-Paranal-subarray-1.simtel.gz
There are 5 sub-arrays to pick from. I don't know what the differences are. The password is the well known CTA password. Contact me if you don't know it. These URLs point to some sub arrays I believe
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/Remerged-3HB8/proton_20deg_0deg_run7866___cta-prod3-merged_desert-2150m-Paranal-3HB8-NG.simtel.gz
https://www.mpi-hd.mpg.de/personalhomes/bernlohr/cta-raw/Prod-3/Paranal/Remerged-3HB8/gamma_20deg_0deg_run5706___cta-prod3-merged_desert-2150m-Paranal-3HB8-NG.simtel.gz
For now check out the xml in the streams-processes
folder.
As CTA is going to produce a huge stream of data, one need to ensure to have enough machines to process this data. Apache Storm is an approach for simple handling of a cluster and deploying tasks (a.k.a topologies) for processing the data.
Using maven profiles it is possible to package cta-streams for various distributed processing frameworks. At the moment following support is enabled:
For Flink and Spark the packaging is done using one step:
mvn -P deploy,{flink,spark} package
Deploying to a Storm cluster requires two jar
files:
Thus, we need two lines of code to produce those packages:
# package for deployment
mvn -P deploy,storm package
# package for local start and transformation step
mvn -P standalone,storm package
As a result following jar
files are produced (one for Flink and Spark, two for Storm):
# run locally
streams-cta-0.0.3-SNAPSHOT-{platform}-compiled.jar
# does not contain storm, will be deployed
streams-cta-0.0.3-SNAPSHOT-storm-provided.jar
First of all, we need a package with Flink:
mvn -P standalone,flink package
Using this jar we are then able to run the example speed.xml
as following:
java -jar target/streams-cta-0.0.3-SNAPSHOT-flink-compiled.jar streams-processes/speed.xml
More details about starting a standalone Flink cluster or deploying jobs to an existing YARN cluster can be found in the streams-flink repository.
We intend to use Java Code Style suggested by google.
In case you're using Java IDE such as IntelliJ or Eclipse you can import this style guide by following the
simple
instruction.
For Mac users the path to the codestyles folder is: ~/Library/Preferences/IdeaICxx/codestyles
Afterwards your IDE can e.g. reformat your code to the Code Style suggested there
(in IntelliJ: Code
-> Reformat Code...
).
Heres a bunch of other proposals and design overviews for the CTA project