I'm currently working on https://github.com/NVIDIA/NeMo-Curator/pull/173 for NeMo Curator, which uses a multifold quality classifier to generate text quality predictions and their probabilities. The goal is to generate different probabilities per model fold and average them to generate a final prediction. However, I'm finding that only the results for the first quality model used in the pipeline are being saved, despite ensuring that the column names are different. See this notebook for an example.
@VibhuJawa suggested that the bug might be caused by CrossFit modifying the same internal flag in the Dask DataFrame. Also, using persist() on the Dask DataFrames produces the correct results, but from my understanding this isn't desirable because the intended use is to read, modify, and write very large JSONL files.
I'm currently working on https://github.com/NVIDIA/NeMo-Curator/pull/173 for NeMo Curator, which uses a multifold quality classifier to generate text quality predictions and their probabilities. The goal is to generate different probabilities per model fold and average them to generate a final prediction. However, I'm finding that only the results for the first quality model used in the pipeline are being saved, despite ensuring that the column names are different. See this notebook for an example.
@VibhuJawa suggested that the bug might be caused by CrossFit modifying the same internal flag in the Dask DataFrame. Also, using
persist()
on the Dask DataFrames produces the correct results, but from my understanding this isn't desirable because the intended use is to read, modify, and write very large JSONL files.