PROJECT NOT UNDER ACTIVE MANAGEMENT
This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
Contact: webadmin@linux.intel.com
HDK is a low-level execution library for data analytics processing.
HDK is used as a fast execution backend in Modin. The HDK library provides a set of components for federating analytic queries to an execution backend based on OmniSciDB. Currently, HDK targets OLAP-style queries expressed as relational algebra or SQL. The APIs required for Modin support have been exposed in a library installed from this repository, pyhdk
. Major and immediate project priorities include:
We are committed to supporting a baseline set of functionality on all x86 CPUs, later-generation NVIDIA GPUs (supporting CUDA 11+), and Intel GPUs. The x86 backend uses LLVM ORCJIT for x86 byte code generation. The NVIDIA backend uses NVPTX extensions in LLVM to generate PTX, which is JIT-compiled by the CUDA runtime compiler. The Intel GPU backend leverages the LLVM SPIR-V translator to produce SPIR-V. Device code is generated using the Intel Graphics Compiler (IGC) via the oneAPI L0 driver.
Config
controls library-wide properties and must be passed to Executor
and DataMgr
. Default config objects should suffice for most installations. Instantiate a config first as part of library setup.
ArrowStorage
is currently the default (and only available) HDK storage layer. ArrowStorage
provides storage support for Apache Arrow format data. The storage layer must be explicitly initialized:
import pyhdk
storage = pyhdk.storage.ArrowStorage(1)
The parameter applied to the ArrowStorage
constructor is the database ID. The database ID allows storage instances to be kept logically separate.
ArrowStorage
automatically converts Arrow format datatypes to omniscidb
datatypes. Some variable length types are not yet supported, but scalar types are available. pyarrow
can be used to convert Pandas DataFrames to Arrow:
at = pyarrow.Table.from_pandas(
pandas.DataFrame({"a": [1, 2, 3], "b": [10, 20, 30]})
)
The arrow table can then be imported using the Arrow storage interface.
opt = pyhdk.storage.TableOptions(2)
storage.importArrowTable(at, "test", opt)
The Data Manager controls the storage and in-memory buffer pools for all queries. Storage engines must be registered with the data manager:
data_mgr = pyhdk.storage.DataMgr()
data_mgr.registerDataProvider(storage)
Three high level components are required to execute a query:
The complete flow is as follows:
calcite = pyhdk.sql.Calcite(storage)
executor = pyhdk.Executor(data_mgr)
ra = calcite.process("SELECT * FROM t;")
rel_alg_executor = pyhdk.sql.RelAlgExecutor(
executor, storage, data_mgr, ra
)
res = rel_alg_executor.execute()
Calcite reads the schema information from storage, and the Executor stores a reference to Data Manager for buffer/storage access during a query.
The return from RelAlgExecutor is a ResultSet object which can be converted to Arrow and to pandas:
df = res.to_arrow().to_pandas()
Standalone examples are available in the examples
directory. Most examples run via Jupyter notebooks.
Miniconda installation is required. (Anaconda may produce build issues.) Use one of these miniconda installers.
Conda environments are used for HDK development. Use the YAML file in omniscidb/scripts/
:
conda env create -f omniscidb/scripts/mapd-deps-conda-dev-env.yml
conda activate omnisci-dev
If using a Conda environment, run the following to build and install HDK:
mkdir build && cd build
cmake ..
make -j
make install
By default GPU support is disabled.
To verify check python -c 'import pyhdk'
executed without an error.
Install extra dependencies into the existing environment:
conda install -c conda-forge level-zero-devel pkg-config
mkdir build && cd build
cmake -DENABLE_L0=on ..
make -j
make install
Install extra dependencies into an existing environment or a new one.
conda install -c conda-forge cudatoolkit-dev arrow-cpp-proc=3.0.0=cuda arrow-cpp=11.0=*cuda
mkdir build && cd build
cmake -DENABLE_CUDA=on ..
make -j
make install
If you meet issues during the build refer to .github/workflows/build.yml
. This file describes the compilation steps used for the CI build.
If you are still facing issues please create a github issue.
Python tests can be run from the python source directory using pytest
.
pytest python/tests/*.py
pytest python/tests/modin
pytest python/tests/
Installation into conda environment.
Clone Modin.
cd modin && pip install -e .
To enable logging:
pyhdk.initLogger(debug_logs=True)
In the setup_class(..)
body.
Logs are by default located in the hdk_log/
folder.