janhq / cortex.cpp

Local AI API Platform
https://cortex.so
Apache License 2.0
2.13k stars 125 forks source link

epic: Cortex Updater can migrate data structure changes #1184

Open dan-homebrew opened 2 months ago

dan-homebrew commented 2 months ago

Goal

Scope

Discussion

Success Criteria:

Tasklist

vansangpfiev commented 1 week ago

Some thoughts on data migration:

  1. Database Migration Strategy
    • Versioning: Create a table within SQLite database to track the current schema version. This table will have a single row containing the version number
      CREATE TABLE IF NOT EXISTS schema_version (
      version INTEGER NOT NULL
      );
    • Migration Scripts:
    • Can create sql scripts for migration by version. These scripts can be executed after run cortex update. Not sure if we can run sql scripts by postscript cc: @hiento09
    • Pros: directly writing SQL scripts for migrations is straightforward and gives us full control over the SQL commands executed during migrations.
    • Cons: must manually organize, track, and apply scripts, which can become cumbersome as the number of migrations grows
    • Migration Framework:
    • Develop a C++ component that reads the current schema version from the database and applies the necessary migration scripts to reach the target version.
    • Pros: provide a consistent approach to managing migrations across different environments
    • Cons: not easy to implement
    • Data Preservation Techniques
    • Data Migration: Before dropping any columns or tables, migrate the existing data to a safe location (e.g., a backup table or external storage) so it can be restored if needed.
    • Column Preservation: Instead of dropping columns, consider marking them as deprecated or renaming them.

Question: should we use migration scripts or implement a C++ database migration component?

  1. Cortex data structure Similar to database migration, we can use a script to rearrange data to new data structure. Note: We store yml file path in database, so if model structure changes, we also need to update the models' database

What do you think? @dan-homebrew @louis-jan @janhq/cortex

dan-homebrew commented 1 week ago

@vansangpfiev Overall, I think we should do what allows us to ship something quickly this Friday.

Up/Down Migrations

/migrations
    1.0.2.cpp
    1.0.3.cpp
# 1.0.2.cpp
void up() {

    // Run the SQL script (assuming it's a shell script that runs SQL commands)
    std::string runSqlCmd = "psql -c \"" + sqlScript + "\"";
    int runSqlResult = system(runSqlCmd.c_str());

    if (runSqlResult == 0) {
        std::cout << "SQL script executed successfully." << std::endl;
    } else {
        std::cerr << "Error: Failed to execute the SQL script." << std::endl;
    }

    // Define the directories
    std::string rootDir = "/";
    std::string targetDir = "/path/to/target/directory";

    // Define the file to be moved
    std::string fileToMove = rootDir + ".cortexrc";

    // Check if the file exists
    std::ifstream file(fileToMove);
    if (!file) {
        std::cerr << "Error: The file .cortexrc does not exist in the root directory." << std::endl;
        return;
    }

    // Move the file to the target directory
    std::string moveCmd = "mv " + fileToMove + " " + targetDir;
    int moveResult = system(moveCmd.c_str());

    if (moveResult == 0) {
        std::cout << "The .cortexrc file has been moved to the target directory." << std::endl;
    } else {
        std::cerr << "Error: Failed to move the .cortexrc file." << std::endl;
    }
void down() {
    ...
}

Please push back on this @janhq/cortex - as you can tell from the RoR reference, my approach may be very outdated.

Database migration strategy

  1. I think SQL migration scripts make most sense
  2. Data Preservation: I think marking columns as deprecated is ok (for now)

Cortex Data Structure

  1. See up/down methods above
louis-jan commented 1 week ago

We likely need to design for both local running and cloud deployment. Adding a code constraint would block our migration from the deployment layer, which likely waits for an instance to spin up and perform a health check, which is inefficient. There could also be a scenario where we want to share the database instance across services like cortex.py and cortex.cpp.

However, one advantage is that they can embed cortex.cpp as a standalone binary without having to worry about other services, which is particularly useful when embedding into Jan.

I’d prefer built-in migrations as suggested above for cortex.cpp to ensure embedding in Jan won’t cause issues.

vansangpfiev commented 1 week ago

We will use SQL scripts for up/down migrations, try to minimize changes between versions to ensure smooth transitions. For complex down migrations, we will create a PR to address any issues.

Structure

cortexcpp
    |__ models/
    |__ engines/
    |__ migrations/
           |__ v1/
           |    |__ v1.sql
           |    |__ data_structure_v1.sh
           |__ v2/
                |__ v2.sql
                |__ data_structure_v2.sh

Question: How do we ship the migration scripts?

cc: @dan-homebrew @hiento09 @janhq/cortex

dan-homebrew commented 1 week ago

@vansangpfiev Our discussion:

cortexcpp
    |__ models/
    |__ engines/
    |__ migrations/
           |__ v1.0.2.cpp
           |__ v1.0.3.cpp
louis-jan commented 1 week ago

Oops. This wouldn’t work for nightly migration support or switching between commits.

dan-homebrew commented 1 week ago

Hmm, can we solve this via running migrations manually for Nightly?

Nightly

Switching between commits

vansangpfiev commented 5 days ago

@dan-homebrew IMO, It is necessary to distinguish between the data version and the software version. Since all migration actions occur within the software, the application may not know its current version during modifications (can be stable, beta, nightly, etc).

To ensure clarity and consistency, we can increment the data version each time a change is made. In the event of a conflict, we will revert to the previous version of the data. This separation allows for more effective management of both software and data changes, enhancing stability and reliability throughout the migration process.

luke-nguyen990 commented 5 days ago

@dan-homebrew IMO, It is necessary to distinguish between the data version and the software version. Since all migration actions occur within the software, the application may not know its current version during modifications (can be stable, beta, nightly, etc).

To ensure clarity and consistency, we can increment the data version each time a change is made. In the event of a conflict, we will revert to the previous version of the data. This separation allows for more effective management of both software and data changes, enhancing stability and reliability throughout the migration process.

TL, DR: If we roll back to a specific commit abc123, we must know exactly how the data schema looks in a deterministic way.