saschamcdonald / ch_06_kuzudb_tests

MIT License
0 stars 0 forks source link

NOTE: This repository has been archived as it addresses an issue that has since been resolved by the KuzuDB team.

Additionally, a minor adjustment is needed to prevent the creation of duplicate edges in the synthetic data. To avoid impacting the progress of KuzuDB and its team, the repository is archived, though it may still provide useful insights. Feel free to explore and use it, keeping in mind that the repo is now in an archived state.

This repository facilitates test data generation and bulk loading into KuzuDB, enabling users to compare different database versions and diagnose errors, such as a known bulk loading issue between two KuzuDB versions. The test data includes dynamically generated Company Nodes, Person Nodes, and WorksAt Relationships based on user configurations.

Project Overview

This repository is designed to facilitate the creation and loading of test data into KuzuDB, highlighting the flexibility and ease of managing data across different database versions. It offers a comprehensive setup for generating test data and a simplistic model tailored for bulk loading into various versions of KuzuDB within isolated virtual environments. Additionally, this setup aims to replicate and analyze a known bulk loading error that occurs between two specified versions of KuzuDB, providing insights and potential workarounds for such issues.

Through detailed logging and observation, users can compare the behavior of different KuzuDB versions under identical conditions, shedding light on the error's nature and its impact on bulk data loading processes. This repository serves as a tool for troubleshooting, optimization, and ensuring the reliability of database operations, related to bulk loading a simple property graph. The test data creates Company Nodes Person Nodes and WorksAt Relationships dynamically based on user configs. Keep the defaults in the .env file to replicate the working and failing versions for testing this specific issue.


Test Data Creation and Loading

Note the defaults in the .env replicate the error but you can configure for your own requirements

The project includes scripts for generating test data, which can then be loaded into KuzuDB. This process is automated and configurable, allowing for a seamless transition between different database versions or testing scenarios. The objective is to create a robust and repeatable process for data handling, which is crucial for testing database performance and error diagnostics.

Configuration via .env file

Note the defaults in the .env replicate the error for kuzu version 0.2.0 and runs sucessfuly with kuzu versio 0.0.11 but you can configure for your own requirements To offer maximum flexibility and ease of use, this project utilizes a .env file located in the src directory for all configuration settings. Users can specify various parameters, including paths, database connection details, and version-specific settings, without altering the core scripts. This approach ensures that the environment remains clean and that changes can be easily managed and replicated.

.env Settings Explained

The .env file allows you to customize the behavior of the test data generation and loading process. Below are the settings you can configure:

# Base path for test data storage
TEST_DATA_PATH=./data
# User-defined settings for the number of records
NUM_COMPANIES=4000000 # Total number of companies
NUM_PERSONS=10000000 # Total number of persons
NUM_RELATIONSHIPS=45000000 # Total number of relationships

# Settings for dynamic property columns, intended as Dynamic Property columns STRING (Default) columns
NUM_DYNAMIC_COMPANY_COLUMNS=5 # Dynamic Property columns STRING for companies, default settings
NUM_DYNAMIC_PERSON_COLUMNS=5 # Dynamic Property columns STRING for persons, default settings
NUM_DYNAMIC_RELATIONSHIP_COLUMNS=5 # Dynamic Property columns STRING for relationships, default settings

Getting Started

Setting up your Python environment, managing virtual environments, and handling project dependencies with ease.

Prerequisites

Before you start, you need to have pyenv installed on your machine. For a detailed guide on installing pyenv and setting up Python environments, refer to the Fathomtech blog post on Python environments with pyenv and virtualenv.

Installation and Setup

Installing Python

Start by installing the required Python version using pyenv:

pyenv install 3.11.6

Install pyenv-virtualenv

After installing the desired Python version, you can create a virtual environment to isolate your project dependencies. Follow these steps if you haven't already installed the pyenv-virtualenv plugin:

  1. Install the pyenv-virtualenv Plugin:

    Clone the plugin into the pyenv plugins directory:

    git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv
  2. Configure Your Shell:

    Add pyenv virtualenv-init to your shell's startup script:

    echo 'eval "$(pyenv virtualenv-init -)"' >> ~/.bashrc

    Restart your shell or source your profile to apply the changes:

    source ~/.bashrc

NOTE: A short cut to automate testing multiple versions automatically rather than manually run this script and configure in the script to automate the tests for all versions:

Create and Activate Virtual Environment and Run the First Test for Kuzu Version 0.1.1:

NOTE: This version successfully loads all the data (PASSES using defaults provided in this repository).

Create and Activate Virtual Environment and run the second test for Kuzu Version 0.2.0:

NOTE: This version successfully loads the Company and Person tables but ERRORS on loading the Relationship data (FAILS using defaults provided in this repository). : ERROR - Failed to import data into WorksAt Node Table. Error details: Buffer manager exception: Failed to claim a frame.

Cleaning Up

After testing with a virtual environment deactivate and remove it:


Tools

Auto Install CLI

The get_cli.sh script is designed for automating the download, decompression, and setup processes of command-line interface (CLI) tools. It enables the use of custom download URLs, efficiently manages version extraction, and organizes files into a neatly structured directory named after the tool and its version. For streamlined operations, it includes a non-interactive mode with default configurations requiring no user input.

For detailed information on usage, customization, and features, see the README for the get_cli.sh script.


License

This project is licensed under the MIT License.

Copyright (c) 2024 Datacue Limited

Author: Sascha McDonald

For more details, see the LICENSE file included in this repository.