A couple of design decisions of FLAViz lead to performance issues.
HDF5 file has to be read in completely before starting with data filtering, selection, etc. This is particularly an issue for large-scale data sets, where this read-in phase might take a long time. If the user only wishes to plot a specific subset of the data, this read-in of the entire file is unnecessary. A solution would be to search the HDF5 file for the subset of data, and retrieve only that part. Probably this requires to move away from HDF5 as a data format, and to use other database formats.
No parallelization. All processes of the python scripts are serial. A couple of points are parallizable however:
1 Data conversion from db to h5. This works on a file-per-file basis and is highly parallisable by simply building in the launching of multiple sub-processes that run the same db_hdf5_v2.py script on a subset of the files. Currently only one core is used.
1b Processing of the per-set-and-run files set_*_run_*.h5
Since these files are being generated at the intermediary stage (translated from set_run.db files), the FLAViz routines could work on these files instead of on the more monolithic Agent.h5 files. If only a subset of data is required for some specific task, it is clearly more efficient to only use the files needed, rather than loading the entire data set into memory.
2 Plotting. Multiple plots are currently processed one by one. Each plot can be a separate sub-process, that retrieves data from the main data set once that has been read-in into main memory.
3 Transformations. If there are multiple tasks, each task could be on its own sub-process, allocated to a different core.
Testing performance
There have been some preliminary attempts to test the performance of the scripts.
This is documented in the manual.
Slicing & indexing the h5 file as numpy arrays with pandas for specific subset of the data will allow the user to plot on a number of subsets instead of the whole file, else try chunking.
A couple of design decisions of FLAViz lead to performance issues.
HDF5 file has to be read in completely before starting with data filtering, selection, etc. This is particularly an issue for large-scale data sets, where this read-in phase might take a long time. If the user only wishes to plot a specific subset of the data, this read-in of the entire file is unnecessary. A solution would be to search the HDF5 file for the subset of data, and retrieve only that part. Probably this requires to move away from HDF5 as a data format, and to use other database formats.
No parallelization. All processes of the python scripts are serial. A couple of points are parallizable however:
1 Data conversion from db to h5. This works on a file-per-file basis and is highly parallisable by simply building in the launching of multiple sub-processes that run the same db_hdf5_v2.py script on a subset of the files. Currently only one core is used.
1b Processing of the per-set-and-run files
set_*_run_*.h5
Since these files are being generated at the intermediary stage (translated from set_run.db files), the FLAViz routines could work on these files instead of on the more monolithic Agent.h5 files. If only a subset of data is required for some specific task, it is clearly more efficient to only use the files needed, rather than loading the entire data set into memory.2 Plotting. Multiple plots are currently processed one by one. Each plot can be a separate sub-process, that retrieves data from the main data set once that has been read-in into main memory.
3 Transformations. If there are multiple tasks, each task could be on its own sub-process, allocated to a different core.
Testing performance
There have been some preliminary attempts to test the performance of the scripts. This is documented in the manual.