Closed sarahyurick closed 2 months ago
For the long-term, I would like to use CrossFit interchangeably with Dask and Dask-cuDF DataFrames.
@sarahyurick , Some clarifications:
What, I think we should do is something simlar to what we do in cuML
Everything else should remain the same.
This is where we can change the code: https://github.com/rapidsai/crossfit/blob/384f3cf3e12df04678711a4c52cd385bf1ea36e0/crossfit/op/base.py#L49-L54
@sarahyurick , Do you want to take a stab at fixing this ?
@sarahyurick , Do you want to take a stab at fixing this ?
Yes thanks, I can work on it.
See https://github.com/NVIDIA/NeMo-Curator/issues/194 for context.
Currently, when trying out this notebook with a CPU Dask DataFrame, it fails with a
TypeError: batch_text_or_text_pairs has to be a list or a tuple (got <class 'pandas.core.series.Series'>)
.I have traced this back to several spots in CrossFit where we depend on GPU libraries. Here are the quick-fix changes I made while chasing the errors:
import pandas as pd
andimport numpy as np
to https://github.com/rapidsai/crossfit/blob/main/crossfit/op/tokenize.py#L24import pandas as pd
to https://github.com/rapidsai/crossfit/blob/main/crossfit/backend/cudf/series.py#L18Change https://github.com/rapidsai/crossfit/blob/main/crossfit/backend/cudf/series.py#L31-L44 to
output = pd.DataFrame()
But after change 6, the error message is difficult to trace:
One option to fix https://github.com/NVIDIA/NeMo-Curator/issues/194 is to implement a non-CrossFit solution for the CPU case, but that is only a temporary solution. For the long-term, I would like to use CrossFit interchangeably with Dask and Dask-cuDF DataFrames.
cc @VibhuJawa