Open taupirho opened 1 month ago
Hmm, I wrote a python script to generate data like so:
import polars as pl
import numpy as np
def generate(nrows: int, filename: str):
names = np.asarray(
[
"Laptop",
"Smartphone",
"Desk",
"Chair",
"Monitor",
"Printer",
"Paper",
"Pen",
"Notebook",
"Coffee Maker",
"Cabinet",
"Plastic Cups",
]
)
categories = np.asarray(
[
"Electronics",
"Electronics",
"Office",
"Office",
"Electronics",
"Electronics",
"Stationery",
"Stationery",
"Stationery",
"Electronics",
"Office",
"Sundry",
]
)
product_id = np.random.randint(len(names), size=nrows)
quantity = np.random.randint(1, 11, size=nrows)
price = np.random.randint(199, 10000, size=nrows) / 100
columns = {
"order_id": np.arange(nrows),
"customer_id": np.random.randint(100, 1000, size=nrows),
"customer_name": [
f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)
],
"product_id": product_id + 200,
"product_names": names[product_id],
"categories": categories[product_id],
"quantity": quantity,
"price": price,
"total": price * quantity,
}
df = pl.DataFrame(columns)
df.write_csv(filename)
generate(100_000_000, "large.csv")
This makes a CSV file that's about 6GiB on disk.
I am not running with WSL (I'm on linux), but my timings look very different for the query you show (OK, the hardware is different too):
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:17:00.0 Off | Off |
| 30% 39C P8 21W / 300W | 24477MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
But:
import polars as pl
df = pl.scan_csv("large.csv", has_header=True)
q = df.select(pl.col("total").sum())
%time q.collect(engine="cpu")
CPU times: user 12.2 s, sys: 1 s, total: 13.2 s
Wall time: 615 ms
%time q.collect(engine="gpu")
CPU times: user 958 ms, sys: 335 ms, total: 1.29 s
Wall time: 1.29 s
So ballpark the same amount of time. And note the absolute speed differences.
So, the GPU is half as fast as the CPU reading CSV. Is that what you'd expect?
So, the GPU is half as fast as the CPU reading CSV. Is that what you'd expect?
On my hardware, it seems that yes, especially with column projection, the polars CPU CSV reader is faster than the GPU reader. The perf difference varies a bit, but about a factor of 2 is what I observe.
My first inclination is that the overhead of moving the data to the GPU is relatively large compared to the cheapness of the cpu doing summation. On top of that, I'm sure there's also some conflation with using the csv scanner rather than loading into memory first. Also, a question, is there actually a csv reader implementation for the gpu? I thought it was just for compute but I haven't looked into it too much.
My first inclination is that the overhead of moving the data to the GPU is relatively large compared to the cheapness of the cpu doing summation. On top of that, I'm sure there's also some conflation with using the csv scanner rather than loading into memory first. Also, a question, is there actually a csv reader implementation for the gpu? I thought it was just for compute but I haven't looked into it too much.
The scan runs on the gpu, and the only direct gpu->cpu transfer is to produce the result (a tiny one row frame). However, I think it is that the polars cpu csv reader is just a bit faster.
What I don't understand is why the wsl version is so much slower than my linux-based test
The Polars csv parser is projection pushdown aware. We skip serializing fields based on the projection pushed down.
Checks
Reproducible example
Log output
No response
Issue description
I created a 100 million CSV record set. The schema looks like this,
First few records,
Here is the Visual Studio C program I used to produce the CSV file
Expected behavior
Expected the GPU code to be faster or same as CPU code
Installed versions