[BUG] OOM when invoking normalize_characters on a relatively small dataframe

miguelusque commented 2 years ago

Describe the bug Hi, I am facing the following OOM when invoking normalize_characters() on a relatively small dataset, which uses less than 2 GB of VRAM on a 16GB v100.

In case it helps, normalize_spaces() works without issues in the same dataframe.

Please, find below how to reproduce it.

Steps/Code to reproduce bug

import cudf import pandas as pd

df = cudf.DataFrame({"text": pd.util.testing.rands_array(256, 5000000)})

df["text"] = df["text"].str.normalize_spaces()

df["text"] = df["text"].str.normalize_characters(do_lower=False)

Expected behavior No OOM

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of cuDF install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

Environment details DGX-v100, cudf 22.04

shwina commented 2 years ago

cc: @davidwendt is this currently expected memory overhead?

davidwendt commented 2 years ago

Yes, this is the expected memory overhead. From what I can tell, the working memory size math looks like the following:

(bytes  * 3 * sizeof(uint32)) +  (bytes * sizeof(uint32)) = (bytes * 12) + (bytes * 4) = bytes * 16
where bytes = number of bytes in the chars child column

So the working memory required is ~16x the number of bytes in the strings column. This does not include any working memory required by various thrust calls or the size of the output column. My recommendation would be to slice the column and then call normalize on the slices. Use the 16x math to help determine the slices.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

miguelusque commented 2 years ago

Hi!

I think this issue is still relevant. IMHO, a 16x memory requirement for a normalize characters operation is a bit high.

Could anyone please consider an alternative implementation? Thanks!

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr commented 6 months ago

@davidwendt I see that you've assigned yourself, did you spend any time considering alternative implementations here? Is this an issue for which you think we could use memory more sparingly with a suitable alternative implementation, or is there some fundamental blocker that would make this unfixable?

davidwendt commented 6 months ago

Yes, I believe there are alternative implementations that could reduce memory usage. There is no fundamental blocker -- just priority and time.

vyasr commented 6 months ago

Cool. Marking as backlogged.

rapidsai / cudf

[BUG] OOM when invoking normalize_characters on a relatively small dataframe #10858

df["text"] = df["text"].str.normalize_spaces()