shamb0 commented 1 month ago

Ticket(s) Closed

Closes #57

This PR is part of a pair; please consider both for review and merge.

https://github.com/paradedb/paradedb/pull/1703 https://github.com/paradedb/pg_analytics/pull/131

What

This PR implements benchmarking functionality to analyze query performance under different caching conditions across various data sources supported by pg_analytics.

Why

To evaluate how different cache configurations impact query performance, ensuring that the system optimally handles various data sources and caching scenarios.

How

The test function follows a structured flow:

Parquet File Check: Verifies the existence of a Parquet file at a specified path; if absent, generates the file.
Data Loading: Loads the Parquet data into a DataFrame using a DataFusion session context.
S3 Setup: Configures an S3 bucket for storing partitioned data.
Data Partitioning: Partitions the data and uploads it to the S3 bucket.
Database Setup: Sets up PostgreSQL tables using the data from S3.
Cache Configuration: Configures caching options such as disk or memory cache.
Benchmark Execution: Executes benchmark iterations with different cache configurations using the criterion framework.
Benchmark Analysis: Analyzes the results using the default metrics from criterion.

The SQL command below is used to toggle Parquet metadata caching (In-memory):

SELECT duckdb_execute($$SET enable_object_cache={cache_setting}$$)

Where cache_setting can be either "true" or "false", depending on the test scenario.

Benchmarking

To run the benchmarking, use the following command:

cd ./cargo-paradedb
RUST_LOG=info cargo run -- paradedb pga-bench parquet-run-all

Integration Notes

The diagram below outlines key components and their interactions, providing a high-level overview of the prototype design:

%% Top-to-Bottom Layout
flowchart TB
    subgraph paradedb["paradedb"]
        direction TB
        package11["cargo-paradedb<br>(rs)<br>Common Benchmarking<br>Orchestrator"]
    end
    subgraph pg_analytics["pg_analytics"]
        direction TB
        package21["tests<br>(rs)<br>Integration Test<br>"]
        package22["tests<br>(rs)<br>fixtures<br>& Tables"]
    end
    subgraph Postgres["Postgres"]
        direction TB
        package31["pg_search<br>(rs)<br>Extension<br>"]
        package32["pg_analytics<br>(rs)<br>Extension<br>"]
    end
    package21 -->|Uses| package22
    package11 -->|Uses<br>As git submodule| package22
    package11 -->|Query| package31
    package22 -->|Query| package32

shamb0 commented 1 month ago

Hi @philippemnoel,

Haha, I definitely understand the challenge of reviewing large PRs! You're right—this PR is a combination of changes from multiple sources, which has made it more complex, breaking these down will significantly improve the review process.

Based on your suggestions, here's my proposed approach to restructure the work:

Component	Action Plan
Benchmarking code	Move to `cargo-paradedb`, use PG DB URL instead of direct Postgres dependency
New `pga_fixtures` crate	Create a separate PR with rationale for new test fixtures
Changes from PR #30	Split into a separate PR for easier review
Remaining changes (PR #115)	Keep in current PR, but streamline for focused review

This breakdown should make each PR more manageable and easier to review. I'll start implementing these changes right away.

Do you think this approach addresses your concerns?

philippemnoel commented 1 month ago

Hi @philippemnoel,

Haha, I definitely understand the challenge of reviewing large PRs! You're right—this PR is a combination of changes from multiple sources, which has made it more complex, breaking these down will significantly improve the review process.

Based on your suggestions, here's my proposed approach to restructure the work:

Component Action Plan Benchmarking code Move to cargo-paradedb, use PG DB URL instead of direct Postgres dependency New pga_fixtures crate Create a separate PR with rationale for new test fixtures Changes from PR #30 Split into a separate PR for easier review Remaining changes (PR #115) Keep in current PR, but streamline for focused review This breakdown should make each PR more manageable and easier to review. I'll start implementing these changes right away.

Do you think this approach addresses your concerns?

Sounds promising!

philippemnoel commented 1 month ago

Going to close this until a more scoped PR is raised.

paradedb / pg_analytics

test: Add query benchmarking suite for pg_analytics #131

Ticket(s) Closed

What

Why

How

Benchmarking

Integration Notes