pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.26k stars 1.96k forks source link

GPU accelerated Polars taking 4 times longer to SUM a column in 100m record CSV than regular CPU. Running in Jupyter Notebook #18832

Open taupirho opened 1 month ago

taupirho commented 1 month ago

Checks

Reproducible example


import polars as pl

df=pl.scan_csv("/mnt/d/sales_data/sales_data_100m.csv", has_header=True)  

%%time

result1=df.select(pl.col("total").sum()).collect()

print(result1)

shape: (1, 1)
┌───────────┐
│ total     │
│ ---       │
│ f64       │
╞═══════════╡
│ 2.4836e11 │
└───────────┘
CPU times: user 4.69 s, sys: 14.8 s, total: 19.5 s
Wall time: 31.2 s

%%time

result2=df.select(pl.col("total").sum()).collect(engine="gpu")

print(result2)

shape: (1, 1)
┌───────────┐
│ total     │
│ ---       │
│ f64       │
╞═══════════╡
│ 2.4836e11 │
└───────────┘
CPU times: user 52.2 s, sys: 2.7 s, total: 54.9 s
Wall time: 2min 45s

Log output

No response

Issue description

I created a 100 million CSV record set. The schema looks like this,

Schema([('order_id', Int64),        ('order_date', Date),        ('customer_id', Int64),        ('customer_name', String),        ('product_id', Int64),       ('product_name', String),        ('category', String),        ('quantity', Int64),        ('price', Float64),        ('total', Float64)])

First few records,

$ sed 10q /mnt/d/sales_data/sales_data_100m.csv

order_id,order_date,customer_id,customer_name,product_id,product_name,category,quantity,price,total
0,2023-04-23,696,Customer_10153,203,Chair,Office,4,880.96,3523.85
1,2023-12-23,302,Customer_3258,209,Coffee Maker,Electronics,6,74.51,447.06
2,2023-02-03,731,Customer_8625,205,Printer,Electronics,4,343.72,1374.89
3,2023-06-04,305,Customer_861,204,Monitor,Electronics,1,957.56,957.56
4,2023-08-26,460,Customer_12402,205,Printer,Electronics,1,46.43,46.43
5,2023-08-20,510,Customer_17197,208,Notebook,Stationery,7,501.68,3511.73
6,2023-07-06,295,Customer_4097,206,Paper,Stationery,7,527.41,3691.88
7,2023-06-14,403,Customer_18866,205,Printer,Electronics,7,947.66,6633.65
8,2023-01-30,230,Customer_1395,206,Paper,Stationery,6,671.9,4031.41

Here is the Visual Studio C program I used to produce the CSV file

#define _CRT_SECURE_NO_WARNINGS // Disable deprecation warnings for Visual Studio

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

// Function to generate a random date within the given range
void generate_random_date(char* buffer, int buffer_size) {
    struct tm start_date = { 0 };
    struct tm end_date = { 0 };

    // Set the start and end dates manually
    start_date.tm_year = 2023 - 1900;
    start_date.tm_mon = 0;
    start_date.tm_mday = 1;

    end_date.tm_year = 2023 - 1900;
    end_date.tm_mon = 11;
    end_date.tm_mday = 31;

    time_t start_time = mktime(&start_date);
    time_t end_time = mktime(&end_date);

    // Calculate the difference in seconds
    double diff_in_seconds = difftime(end_time, start_time);

    // Generate a random number of seconds to add to the start time
    double random_seconds = ((double)rand() / RAND_MAX) * diff_in_seconds;
    time_t random_time = start_time + (time_t)random_seconds;

    struct tm random_date;
    localtime_s(&random_date, &random_time); // Use localtime_s for security

    strftime(buffer, buffer_size, "%Y-%m-%d", &random_date);
}

// Function to generate a random float between min and max
float generate_random_float(float min, float max) {
    return min + ((float)rand() / RAND_MAX) * (max - min);
}

int main() {
    // Seed the random number generator
    srand((unsigned)time(NULL));

    // Number of records
    int num_records = 100000000;

    // Open the CSV file for writing
    FILE* file;
    fopen_s(&file, "d:/sales_data/sales_data_100m.csv", "w"); // Use fopen_s for security
    if (file == NULL) {
        fprintf(stderr, "Error opening file for writing\n");
        return 1;
    }

    // Write the header
    fprintf(file, "order_id,order_date,customer_id,customer_name,product_id,product_name,category,quantity,price,total\n");

    // Product names and categories
    const char* product_names[] = { "Laptop", "Smartphone", "Desk",    "Chair",  "Monitor",        "Printer",         "Paper",           "Pen",               "Notebook",  "Coffee Maker","Cabinet","Plastic Cups" };
    const char* categories[] = { "Electronics", "Electronics",    "Office", "Office", "Electronics", "Electronics", "Stationery", "Stationery", "Stationery", "Electronics","Office","Sundry" };
    // Generate fake data
    for (int i = 0; i < num_records; i++) {
        char order_date[11];
        generate_random_date(order_date, sizeof(order_date));

        int customer_id = rand() % 900 + 100; // Random customer ID between 100 and 999
        int product_index = rand() % 10; // Random product index
        int quantity = rand() % 10 + 1; // Random quantity between 1 and 10
        float price = generate_random_float(1.99, 999.99); // Random price between 1.99 and 999.99
        float total = price * quantity;

        // Generate a random customer name
        char customer_name[50];
        snprintf(customer_name, sizeof(customer_name), "Customer_%d", rand());

        // Write the record to the CSV file
        fprintf(file, "%d,%s,%d,%s,%d,%s,%s,%d,%.2f,%.2f\n",
            i,
            order_date,
            customer_id,
            customer_name,
            product_index + 200, // Product ID between 200 and 210
            product_names[product_index],
            categories[product_index],
            quantity,
            price,
            total);
    }

    // Close the CSV file
    fclose(file);

    printf("CSV file with fake sales data has been created.\n");

    return 0;
}

Expected behavior

Expected the GPU code to be faster or same as CPU code

Installed versions

``` --------Version info--------- Polars: 1.7.1 Index type: UInt32 Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager altair 5.3.0 cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec 2024.2.0 gevent great_tables matplotlib 3.9.1 nest_asyncio 1.5.8 numpy 1.24.4 openpyxl 3.1.2 pandas 2.2.2 pyarrow 16.1.0 pydantic 2.6.1 pyiceberg sqlalchemy 2.0.25 torch 2.3.1+cu121 xlsx2csv xlsxwriter 3.1.9 ```
wence- commented 1 month ago

Hmm, I wrote a python script to generate data like so:

import polars as pl
import numpy as np

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )
    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100
    columns = {
        "order_id": np.arange(nrows),
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [
            f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)
        ],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }
    df = pl.DataFrame(columns)
    df.write_csv(filename)

generate(100_000_000, "large.csv")

This makes a CSV file that's about 6GiB on disk.

I am not running with WSL (I'm on linux), but my timings look very different for the query you show (OK, the hardware is different too):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:17:00.0 Off |                  Off |
| 30%   39C    P8             21W /  300W |   24477MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

But:

import polars as pl

df = pl.scan_csv("large.csv", has_header=True)

q = df.select(pl.col("total").sum())

%time q.collect(engine="cpu")
CPU times: user 12.2 s, sys: 1 s, total: 13.2 s
Wall time: 615 ms

%time q.collect(engine="gpu")
CPU times: user 958 ms, sys: 335 ms, total: 1.29 s
Wall time: 1.29 s

So ballpark the same amount of time. And note the absolute speed differences.

taupirho commented 1 month ago

So, the GPU is half as fast as the CPU reading CSV. Is that what you'd expect?

wence- commented 1 month ago

So, the GPU is half as fast as the CPU reading CSV. Is that what you'd expect?

On my hardware, it seems that yes, especially with column projection, the polars CPU CSV reader is faster than the GPU reader. The perf difference varies a bit, but about a factor of 2 is what I observe.

deanm0000 commented 1 month ago

My first inclination is that the overhead of moving the data to the GPU is relatively large compared to the cheapness of the cpu doing summation. On top of that, I'm sure there's also some conflation with using the csv scanner rather than loading into memory first. Also, a question, is there actually a csv reader implementation for the gpu? I thought it was just for compute but I haven't looked into it too much.

wence- commented 1 month ago

My first inclination is that the overhead of moving the data to the GPU is relatively large compared to the cheapness of the cpu doing summation. On top of that, I'm sure there's also some conflation with using the csv scanner rather than loading into memory first. Also, a question, is there actually a csv reader implementation for the gpu? I thought it was just for compute but I haven't looked into it too much.

The scan runs on the gpu, and the only direct gpu->cpu transfer is to produce the result (a tiny one row frame). However, I think it is that the polars cpu csv reader is just a bit faster.

What I don't understand is why the wsl version is so much slower than my linux-based test

ritchie46 commented 1 month ago

The Polars csv parser is projection pushdown aware. We skip serializing fields based on the projection pushed down.