Released under The MIT License (refer to LICENSE file)
Software | Version | Test Version | Purpose |
---|---|---|---|
Python | 2.7 | 2.7 | profiler and post-processing |
gnuplot | >= 4.6 | 4.6.3 | plots for post-processing |
Perl | 5.10.1 | 5.10.1 | generating workflow script |
collect_stats.ksh | 0.1 | 0.1 | data collection |
kill_scripts/ | 0.1 | 0.1 | halting data collection |
systat | 9.0.4** | 9.0.4 | Sar and Iostat tool |
** We have tested with systat 9.0.4, and have built-in support for older versions by using --legacy option in sar. However, behavior with older versions is not guaranteed.
User-defined workflows or pipelines can be defined as an ordered sequence of one or more processing stages. Workflow Profiler provides users with a fast and an automated means to profile and understand coarse-grain resource utilization for user-defined workflows by using freely available Linux profiling tools. It allows for easy definition of workflows, and maintains explicit partition of the individual stages to provide the user with a clear view of the system utilization across the various stages while still offering a unified view of the overall workflow. It is a complete package that automates the execution of the workflow, stats collection, and post-processes the profiled data to generate CSVs and plots illustrating the resource utilization.
The overall workflow profiling can be broken down into two main stages:
Workflow execution and data collection:
Post-processing:
Create a workflow based on data_collection_workflow_template.pl, where each stage of the workflow is preceeded by a call to start profiling, and followed by a call to stop profiling. For each stage, specify a stage tag to identify the stage. If profiling option is turned ON, system-level data will be gathered via linux profiling tools such as 'sar' and 'iostat'. collect_stats.ksh script is used to automate data collection for various profiling tools. OR Modify your existing workflow script by adding calls to start and stop profiling for each stage in the workflow via collect_stats.ksh, using the template as an example.
Add a dictionary to the workflow_stats_parser/workflow_dictionaries.py that specifies the order of the various stages in your workflow (corresponding the stage tag used in 1))
For sar data files, the character count for the full path name is restricted to 254. Inorder to not exceed it, please keep the profiler's output directory path relatively short and specify the above dictionary with relevant but small stage names.
Archiving: Output directories are timestamped, hence user can run the profiler multiple times and achieve automatic archiving of resulting data.
The workflow profiler script can be used in two ways:
Post-processing only mode:
workflow_profiler.py : collects data for the workflow and post-processes the profiled data to generate csv's and plots.
Usage: workflow_profiler.py workflow_script workflow_name sample_name no_of_threads input_directory output_directory [flags]
positional arguments [Need to be provided in the following order]:
optional arguments:
statistics: statistics options
Examples:
The goal of the Workflow Stats Parser is to parse the raw data gathered using sar and iostat and generate charts which illustrate resource utilization for workflows. The parser supports post-processing both a single stage workflow as well as a multi-stage workflow.
This section describes the use of parser as a stand-alone tool.
The parser package assumes data is collected using collect_stats.ksh, usage described below
Software Requirements
Software | Version | Test Version | Purpose |
---|---|---|---|
Python | 2.7 | 2.7 | post-processing |
gnuplot | >= 4.6 | 4.6.3 | plots for post-processing |
HOW-TOs
a. Usage and Argument/Options Description
workflow_stats_parser.py root [arguments] arguments = [-N workflow_name] [-i | -s | -A] [-h] [-S substring] [-o output_folder] [-p] [-w size] [-t tag] [-l level]
a.1 Positional Arguments
root path of directory containing workflow's profile data
a.2 Required Arguments
-N, --workflow_name workflow_name name of your workflow. Default is 'sample'.
Statistics:
a.3 Optional Arguments
-h, --help
show help message and exit
-S, --single_step stepSearchString To process a single stage of a known workflow. Specify a substring that is present in the stage output directory name. This will correspond to the 2nd item in the two-tuple in the ordered dictionary for the workflow. See 'How to Add a New Workflow' section below.
-o, --output outputDir
Specify path to directory in which to save post-procesed data and
plots. The parser creates the directory if it does not exist.
Default is post_processed_stats/ in current directory.
-p, --plot plot all data
-w, --sliding_window window
Window size in seconds to use for smoothing graphs.
Default is 100.
We recommend starting with the default; if the graphs are not
smooth, then reprocess with a lower or higher window until the
graphs look good.
-t, --tag Supply an optional identifier for the plot files Defaults to the name of the root directory for the profile data
-l, --log level
Set the log level.
Default level is 'info'.
See 'Output Logger' section below for details.
b. Usage Examples We show several examples of running the parser. For sample output data that is in the parser's directory, we have indicated this with an '*'.
b.1 Full workflow using sample data
From sample_multistage_input: Run: ./workflow_stats_parser.py sample_multistage_input -N sample -o testing/multistage -isp
*Output Sample Data: sample_multistage_output/
b.2 Single stage using sample data
From sample_onestage_input: Run: ./workflow_stats_parser.py sample_onestage_input/ -N sample -S stage1 -o testing/stage1 -isp
*Output Sample: sample_onestage_output/
One stage from sample_multistage_input: ./workflow_stats_parser.py sample_multistage_input -N sample -S stage2 -o testing/stage2 -isp
c. How to Add a New Workflow
c.1 Python Ordered Dictionaries in the Parser For a new workflow edit the workflow_dictionaries.py script by adding a new Ordered Dictionary that defines the order of the workflow steps and search substrings that must exist in the directory names of the workflow step's output.
We use ordered dictionaries to represent workflows because we need to
know the order of the pipline steps for summarizing profiled data in our
plots. We require a mapping of each step to the location of its
corresponding profiled data.
c.2 Structure of a Workflow Ordered Dictionary The format for a two-tuple in a workflow ordered dictionary is:
('stepName', 'dirSearchSubstring')
where:
stepName - the name of the workflow step (stage name)
dirSearchSubstring - a unique substring from all other steps that
is present in the step's output directory
name.
We use the user-supplied 'root' argument and the dirSearchSubtring
to locate the correct data directory for each step.
c.3 Ordered Dictionary Examples
The dictionary for the sample data set in this release is:
sample_dict = OrderedDict([('Stage1','stage1'),
('Stage2','stage2'),
('Stage3','stage3')])
The directory names for each step contains the substring specified
for the step (second item in the two-tuple):
run.test..stage1.1u
run.test..stage2.1u
run.test..stage3.1u
User can add their own dictionaries in workflow_dictionaries.py file.
d. Ouput Logger The logging level can be set to one of the levels listed below via the command line option. Only messages as severe or more severe than the level will be printed.
Possible values (there are five Python-defined levels): info - For routine event that might be of interest debug - For messages useful to debugging, such as dumping variables
Additional information about Python's logging module can be found at: https://docs.python.org/2.6/library/logging.html
####################################################################################### ####################################################################################### If you are interested in the usage model for the componenets themselves, please find them below:
Current template and support is for a perl script.
Usage: data_collection_workflow_template.pl SampleName NumThreads InputDirectory OutputDirectory profiling [optional]: interval stats
Mandatory Options:
Optional:
Example: data_collection_workflow.pl simulated 16 /data/simulated/ /foo/test/ 1 30 "--sar --iostat"
Usage: collect_stats.ksh <--sar || --iostat || --kill-all>
Mandatory Options:
Note: -l, -u, and -s options are not used when collecting SAR data although required.
Examples: 1) Start SAR data collection: "./collect_stats.ksh --sar -td /foo/stats -n test -tag stage -l 5 -u 1 -s 600" 2) Stop data collection (to stop sar and iostat): "collect_stats.ksh --kill-all" - uses the scripts under kill_scripts