iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.58k stars 577 forks source link

Add generated performance counters to the compiler and support for extracting them. #8015

Open benvanik opened 2 years ago

benvanik commented 2 years ago

The idea is to emit IR containing a metadata header and then manipulations to buffers containing the counter data. For incrementing counters this can use a fixed buffer while event counters can use a ringbuffer. We can partition this into two types of counters: dispatch counters that are manipulated within dispatch regions and host counters that we can touch from the host. Things like printf from dispatches could be implemented using this too.

We could have counters declared early with something like util.counter @name : i32 and util.counter.increment @name that we then pack/distribute/etc lower down. This would allow frontends to insert their own counters or add them during flow/stream transformation. By the time we get to HAL we need all the counter buffers materialized and for anything we are passing into dispatch regions the interfaces have to be defined.

To get the data out at runtime we can export a @__query_counters function or something. This could spit out a !vm.buffer with the contents encoded in a binary form that we could dump to a file from the tool (iree-run-module/etc) and then convert to whatever we wanted (json/csv/etc) - we don't want to be performing string manipulation and conversion inside of dispatch regions or the user VM code.

This same approach can get us code coverage and such by inserting the counters at block boundaries - and some python that converts the binary output into a gcov file.

The big thing to figure out is the phase ordering between when we need to reserve resources for the counters (end of stream/beginning of hal) and when we may want to add additional counters (possibly at the end of hal). We could reserve space, keep the counter buffers dynamic until the very end, etc.

benvanik commented 2 years ago

For dispatch-side counters/events we'd need atomic operations - not sure if those are modeled all the way through MLIR yet. We could make such things a HAL op (hal.interface.counter.increment etc) such that we do the final lowering in the target backend.

benvanik commented 1 year ago

Doing some pre-work on this. Idea is to have a chunked file format designed for concatenation of rodata/rwdata blobs and mixing of sources such that we can do both host/device gathers of iovecs and compiler/runtime sources. Each VM module can have an optional exported function like func.func @__query_instrumentation_data(%iovecs: !util.list<!util.buffer>>) that appends zero or more byte buffers containing chunks or partial chunk payloads. The compiler will generate these functions when instrumentation is enabled and do the work to populate both host data and fetch data from devices as required. Runtime modules like the HAL or custom user modules can also define the function to include their own data. The only tool change required is a helper that runs through all modules in a context and calls the function to get the buffers to fwrite to the final output file (or use for other transports) allowing for command line tool, embedded application usage, and python/etc binding queries.

File format:

// Copyright 2022 The IREE Authors
//
// Licensed under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

#ifndef IREE_SCHEMAS_IDB_H_
#define IREE_SCHEMAS_IDB_H_

#include <stdint.h>

//===----------------------------------------------------------------------===//
// IREE Instrumentation Database File Format
//===----------------------------------------------------------------------===//
// Extensible chunked file format for runtime instrumentation storage and
// compiler feedback ingestion. Each file is composed of zero or more chunks
// with a prefix of iree_idb_chunk_header_t and varying length. Each chunk is of
// a particular well-known type or user-defined and the file is designed to be
// partially parsable even if certain chunks are unknown or have unsupported
// versions.
//
// The goal of the file format is to allow for cheap concatenation of read-only
// and read-write data that may exist in independent buffers. This allows for
// things like metadata and string-tables to be stored in .rdata of application
// binaries or vm.rodata in VMFB files while the dynamically defined data like
// instrument counters and logs can come from either host or device buffers
// stitched together via concatenation.
//
// The file format is semi-stateful but single-pass parsable with chunks such
// as string tables being used for all subsequent string table references until
// the next string table is defined. This allows for modules or host/device
// chunks to be concatenated even after the data structures have been defined.
// This allows for many stages of the compiler lowering to emit their own
// independent string tables and instrument sets without needing a global
// planning step which would otherwise be impossible when combining
// compiler-defined instrumentation emitted during compilation with
// runtime-defined instrumentation the compiler may not have visibility into.
//
// The primary representation of this format is binary so that values can be
// updated on both host and device at runtime without the need for string
// formatting and while being able to use atomic operations. Converters and
// filters can be written as CLI tools to extract specific information in
// textual form where required.
//
// Example file composition:
//  [file header]         <- from vm.rodata
//  [string table A]      <- from vm.rodata
//  [instrument table 0]  <- from vm.rodata
//    [instrument data 0] <- from vm.rwdata
//  [instrument table 1]  <- from vm.rodata
//    [instrument data 1] <- from device buffer readback
//  [string table B]      <- from .rdata in custom module executable
//  [instrument table 2]  <- from .rdata in custom module executable
//    [instrument data 2] <- from custom module state
//
// The `instrument` dialect in the compiler is designed to model instruments
// that eventually lower into this format. Modules produced by the compiler with
// instrumentation enabled will have an exported function that can be used to
// query a set of buffers from the module to concatenate into the final file.
// Custom modules provided at runtime can also export the same function to have
// their own buffers included in the final file. Since the format is designed
// with chunk-relative offsets each query function populates a set of iovec-like
// buffer segments that get stitched together with no padding.

//===----------------------------------------------------------------------===//
// iree_idb_chunk_t
//===----------------------------------------------------------------------===//

// Chunk magic identifier.
// "IREE Instrumentation Database"
// "IIDB" = 0x49 0x49 0x44 0x42
typedef uint32_t iree_idb_chunk_magic_t;

// Chunk type.
enum iree_idb_chunk_type_e {
  // iree_idb_file_header_t
  IREE_IDB_CHUNK_TYPE_FILE_HEADER = 0x0000u,
  // iree_idb_string_table_t
  IREE_IDB_CHUNK_TYPE_STRING_TABLE = 0x0001u,
  // iree_idb_instrument_table_t
  IREE_IDB_CHUNK_TYPE_INSTRUMENT_TABLE = 0x0010u,
  // iree_idb_user_data_t
  IREE_IDB_CHUNK_TYPE_USER_DATA = 0xFFFFu,
};
typedef uint16_t iree_idb_chunk_type_t;

// IDB chunk format version.
// Instruments and other embedded chunks may version themselves independently to
// prevent entire files from being invalidated on compiler bumps.
typedef uint16_t iree_idb_chunk_version_t;

// Header at the prefix of each chunk in the file.
// Always aligned to 16-bytes in the file such that the trailing chunk contents
// are 16-byte aligned.
typedef struct {
  // Magic header bytes; must be `IIDB`.
  iree_idb_chunk_magic_t magic;
  // Type of the chunk used to interpret the payload.
  iree_idb_chunk_type_t type;
  // Type-specific version identifier. Usually 0.
  iree_idb_chunk_version_t version;
  // Total byte length of the chunk content excluding this header.
  uint64_t content_length;
} iree_idb_chunk_header_t;
static_assert(sizeof(iree_idb_chunk_header_t) % 16 == 0,
              "chunk header must be 16-byte aligned");

//===----------------------------------------------------------------------===//
// iree_idb_file_header_t / IREE_IDB_CHUNK_TYPE_FILE_HEADER
//===----------------------------------------------------------------------===//

// Reserved file header.
typedef struct {
  // Chunk header with type = IREE_IDB_CHUNK_TYPE_FILE_HEADER.
  iree_idb_chunk_header_t header;
  // TODO(benvanik): compiler/runtime versioning.
  // TODO(benvanik): user-supplied build ID/etc.
  // TODO(benvanik): source model reference for correlating source maps.
} iree_idb_file_header_t;

//===----------------------------------------------------------------------===//
// iree_idb_string_table_t / IREE_IDB_CHUNK_TYPE_STRING_TABLE
//===----------------------------------------------------------------------===//

// String within the string table as an ordinal mapping into the view list.
// An ID of 0 indicates an empty string.
typedef uint32_t iree_idb_string_id_t;

// References a UTF-8 string character range in the string table payload.
typedef struct {
  // Offset of the string data within the string table payload.
  uint64_t offset;
  // Length in bytes (chars) of the string data excluding the NUL terminator.
  uint64_t length;
} iree_idb_string_view_t;

// String table providing character ranges within a data payload.
// Strings are encoded as UTF-8 and all have NUL terminators.
typedef struct {
  // Chunk header with type = IREE_IDB_CHUNK_TYPE_STRING_TABLE.
  iree_idb_chunk_header_t header;
  // (do not use, leave 0)
  uint32_t reserved;
  // Total number of strings in the table.
  uint32_t count;
  // String data ranges within the string table payload.
  iree_idb_string_view_t views[];
  // + data follows, view offsets relative to end of table struct
} iree_idb_string_table_t;

//===----------------------------------------------------------------------===//
// iree_idb_instrument_table_t / IREE_IDB_CHUNK_TYPE_INSTRUMENT_TABLE
//===----------------------------------------------------------------------===//

// Defines an instrument by specifying its metadata and payload data range.
typedef struct {
  // String table reference to an instrument type identifier.
  // Unknown identifiers are allowed to support user-defined instruments.
  // Examples: `iree.instrument.counter`, `iree.instrument.range`.
  iree_idb_string_id_t type_str;

  // String table reference to a compilation stage the instrument applies to.
  // When instruments are used for feedback the particular compiler passes that
  // use the instruments will limit their search to their matching stage.
  // Examples: `iree.flow`, `iree.stream`.
  iree_idb_string_id_t stage_str;

  // A compiler-defined hash of the IR within the isolated scope the instrument
  // is defined within such that any changes to the IR will invalidate the
  // instrument if used for feedback.
  iree_idb_string_id_t scope_hash_str;

  // Locally-unique identifier for the instrument within its parent scope.
  // This is used to disambiguate instruments within a scope when there are
  // multiple instruments of the same type.
  // Examples: `#54`, `some_useful_probe`.
  iree_idb_string_id_t scope_key_str;

  // Optional JSON metadata used by analysis tools or the compiler to interpret
  // the instrument.
  iree_idb_string_id_t metadata_str;

  // Relative offset of the instrument data payload within the subsequent
  // iree_idb_instrument_data_t contents.
  uint64_t payload_offset;

  // Total reserved length of the instrument data payload within the subsequent
  // iree_idb_instrument_data_t contents. Some instruments may have dynamic
  // valid data lengths in an instrument-specific format (such as ringbuffers).
  uint64_t payload_length;
} iree_idb_instrument_def_t;

// Instrument table defining a set of instruments and their payload data.
// Each instrument specifies a range of bytes in the table payload that contain
// the collected data. Depending on the instrument the payload length specified
// may be a maximum of the entire allocated buffer with instrument-specific
// fields specifying valid ranges within that range.
typedef struct {
  // Chunk header with type = IREE_IDB_CHUNK_TYPE_INSTRUMENT_TABLE.
  iree_idb_chunk_header_t header;
  // (do not use, leave 0)
  uint32_t reserved[3];
  // Total number of instrument definitions.
  uint32_t count;
  // Instrument definitions.
  iree_idb_instrument_def_t defs[];
  // + data follows, instrument offsets relative to end of table struct
} iree_idb_instrument_table_t;
static_assert(sizeof(iree_idb_instrument_table_t) % 16 == 0,
              "instrument table header must be 16-byte aligned");

//===----------------------------------------------------------------------===//
// iree_idb_user_data_t / IREE_IDB_CHUNK_TYPE_USER_DATA
//===----------------------------------------------------------------------===//

// Opaque user data chunk.
// The type and metadata strings can be used to define what the chunk contains.
typedef struct {
  // Chunk header with type = IREE_IDB_CHUNK_TYPE_USER_DATA.
  iree_idb_chunk_header_t header;
  // String table reference to the user-defined type.
  // This should be scoped/namespaced to avoid conflicts.
  // Examples: `my.user.data`.
  iree_idb_string_id_t type_str;
  // Optional metadata in a user-defined format used by analysis tools.
  iree_idb_string_id_t metadata_str;
  // (do not use, leave 0)
  uint64_t reserved;
  // + data follows
} iree_idb_user_data_t;
static_assert(sizeof(iree_idb_user_data_t) % 16 == 0,
              "user data header must be 16-byte aligned");

#endif  // IREE_SCHEMAS_IDB_H_

In the compiler a new instrument dialect will be used to define instruments that are added at various compiler stages for either pure data acquisition or compiler feedback. Similar to LLVM's PGO data these instruments can be specified on an IR hash and identifier that allows for the compiler to consume an instrumentation data file and optimistically query values from it. There can be any number of instrument types that we support but we can also allow users to add their own data either for their own feedback or analysis tools.

Some examples:

// Source IDs used to annotate IR and tag each op:
#instrument.source.id<"stage", "unique id? op index?">
#instrument.source.id<"stage", "unique id? op index?", "parent hash">
#stage_unique_id[_shorthash]?

// Logs a value with the given tag to a ringbuffer:
%1 = instrument.log.value<"foo", "hash", "id"> "hello", %0 : i32

// Emits the value at a specific program point to a ringbuffer:
instrument.probe.emit<...> %0 : i32
// SSA-friendly non-side-effecting probe:
%1 = instrument.probe.value<...> %0 : i32

// Add or set a counter value (side-effecting):
instrument.counter.add<"foo", "hash", "id"> %v : index
instrument.counter.set<"foo", "hash", "id"> %v : index
// SSA-friendly non-side-effecting counter:
%1 = instrument.counter.hit<"foo", "hash", "id"> %v : any

// SSA-friendly non-side-effecting value range:
%1 = instrument.range.value<"foo", "hash", "id"> %0 : i32

// Import instrument storage data from another source, such as a device staging buffer:
instrument.storage.import<"foo", "hash", "id"> %dev0 : !util.buffer

The SSA-friendly ops can be used for feedback. The compiler pass that inserts them would specify their hash/ID based on IR structure such as the hash of the isolated ancestor op and the block/op ordinal (or whatever) as well as attaching the stage at which the value is expected to be used (input/flow/stream/hal/etc). When executed the values will be recorded in the data file and saved off for later use. Subsequent compiler runs can then load the data files and a pass running immediately after the instruments were added to the IR would run through each instrument point and query the data file to see if a usable value is present. Based on the instrument type the existing instrument op would be replaced with one or more new ops; for example:

%1, %hits = instrument.counter.hit<"foo", "hash", "id"> %v : index
%alot = cmpi ge %hits, %c100 : index
scf.if %alot {
  // specialize
} else {
  // generic
}

->

%1 = %v  // replace
%hits = arith.constant 192038 : index
%alot = cmpi ge %hits, %c100 : index
scf.if %alot {
  // specialize
} else {
  // generic
}

->

// specialize

Improved data-flow analysis will allow things like range instruments to provide min/avg/max bounds and probes can track potential value sets. User passes can do whatever they want with the instruments including embedding opaque data.

The instrument dialect can have different lowerings based on where the instruments are defined. The host side may prefer to have globals for each instrument facet and then scribble them into buffers during acquisition while device executables will want to add new bindings to shared device buffers and perform atomic operations on the buffers. This allows for all instruments to work even when running concurrently and to the various codegen backends they'll appear no different than normal user code (read/write bindings with atomic memref operations/etc). Since each instrument is tagged with the stage it originates it's possible for any compiler stage to emit its own instruments without impacting other stages.

Work plan: