rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[BUG] Performance difference between cudf and dask_cudf when reading jsonl files #10867

Open miguelusque opened 2 years ago

miguelusque commented 2 years ago

Describe the bug Hi, I have noticed a difference in performance when reading a jsonl file with cudf and dask_cudf.

In both cases, I will be using only 1 GPU.

I have the following files (see details below):

Please find below the execution time when I run them on a DGX1 v100 (16GBs):

(rapids) root@6ccf9a94ad0e:/rapids/notebooks/host# python jsonl_cudf.py 
4.183666706085205
(rapids) root@6ccf9a94ad0e:/rapids/notebooks/host# python jsonl_dask_cudf.py
6.8754589557647705

The scripts content is as follows: json_cudf.py

import cudf
import time

start = time.time()
df = cudf.read_json("x00_002GB.jsonl", lines=True)
end = time.time()
print(end - start)

and jsonl_dask_cudf.py

import dask_cudf
import time

start = time.time()
df = dask_cudf.read_json("x00_002GB.jsonl", lines=True)
end = time.time()
print(end - start)

Steps/Code to reproduce bug Hi @shwina , as discussed in the Slack channel, I will send you an email with the link to the dataset used. Thanks!

Expected behavior Not such a huge difference in performance.

Environment overview (please complete the following information) DGX-A100, cuda 11.5, rapids 22.04

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

quasiben commented 1 year ago

If this is still and issue can you @miguelusque post and example file here ?