rapidsai-community / notebooks-contrib

RAPIDS Community Notebooks
Apache License 2.0
507 stars 263 forks source link

[QST] Really slow csv process #220

Open BlueFelix opened 4 years ago

BlueFelix commented 4 years ago

What is your question? I' trying to run NYCTaxi-E2E and noted very slow csv process,

Below part takes 2min 34s on 16V100, is it normal?

%%time
X_train = taxi_df.query('day < 25').persist()

# create a Y_train ddf with just the target variable
Y_train = X_train[['fare_amount']].persist()
# drop the target variable from the training ddf
X_train = X_train[X_train.columns.difference(['fare_amount'])]

# this wont return until all data is in GPU memory
done = wait([X_train, Y_train])
taureandyernv commented 4 years ago

Hey @BlueFelix this part may be slow due to the fact that it's downloading ~300GB of data into GPU memory and bandwidth/speed can vary. I know that for me, it does take some time. I've found that getting the data is sometimes the longest part of a notebook. :). Does this help?