rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.23k stars 884 forks source link

[FEA] Story - Improve performance with long strings #13048

Open GregoryKimball opened 1 year ago

GregoryKimball commented 1 year ago

Many strings APIs in libcudf use thread-per-string parallelism in their implementation. This approach works great for processing smaller strings of relatively consistent length. However, for long strings (roughly 256 bytes and above) the performance of thread-per-string algorithms begins to degrade. Some strings APIs are compatible with data-parallel algorithms and can be refactored to improve performance for long strings, while other strings APIs are difficult to refactor with data-parallel algorithms.

Let's use this issue to track the progression: ✅ - this API works well with long strings 🟢 - we think this API will be straightforward to refactor 🟡 - we have some ideas on how to refactor this API, and we'll need to experiment 🔴 - we think this will be very difficult to refactor! ⚪ - long string support is not a priority for this API

Module Function Status Notes
Case capitalize
title
is_title
to_lower
to_upper
swapcase
🟡
🟡
🟡
✅#13142
✅#13142
✅#13142
Character Types all_characters_of_type
filter_characters_of_type
✅#13259
🔴
Combining join_strings
concatenate
join_list_elements
✅#13283
🟡
🔴
Searching contains_re
matches_re
count_re
like
find_all
🟡

🔴
🟢#13594
🔴
Converting to_XXXX
from_XXXX

these are rarely long strings
Copying repeat_string
repeat_strings

One overload is an exception
Slicing slice_strings ✅#13057 One overload allows for skipping characters.
Long string support is not a priority for
step > 1 or step < 0
Finding find
rfind
contains
starts_with
ends_with
find_multiple
✅#13226
✅#13226
✅#10739


🟢
Modifying pad
zfill
reverse
strip
translate
filter_characters
wrap
🟡

🟡

🟡
🔴
🔴
Replacing replace
replace_slice
replace_re
replace_with_backrefs
✅#12858
🟡
🔴
🔴
Splitting partition
split
split_record
split_re
split_record_re
🟡
✅#4922 #13680
✅#12729
🔴
🔴
other count_characters
count_bytes
✅#12779
🟢

Libcudf also includes NVText APIs that will benefit from improvements in performance when processing long strings. Generally long string performance is even more important for our text APIs, where each row could represent a sentence, paragraph or document.

Module Function Status Notes
NGrams generate_ngrams
generate_character_ngrams
ngrams_tokenize

🟢
🟢#13480
these are generally not long strings
Normalizing normalize_characters
normalize_spaces
🟢
🟢#13480
Stemming is_letter
porter_stemmer_measure
🟢
🟢
Edit Distance edit_distance
edit_distance_matrix

these are generally not long strings
Tokenizing byte_pair_encoding
subword_tokenize
tokenize
count_tokens
character_tokenize
detokenize
🟡
🟡
🟢#13480
🟢#13480
🟡
🟡
Replacing replace_tokens
filter_tokens
🟢#13480
🟢#13480
MinHashing minhash ✅#13333
bdice commented 1 year ago

This is a great use of a "Story" issue for documenting this work. I have some very high level comments / ideas:

GregoryKimball commented 1 year ago

After studying the results of the strings microbenchmarks in libcudf, I'd like to share some results and early thoughts on the performance we should target with long strings.

First, here is data throughput versus data size for the split_record API. It shows about 15-35 GB/s throughput at the ~100 MB data size for string lengths between 32 and 8192 bytes.

image

Next, here is the same plot for filter_characters_of_type. It shows 20 GB/s throughput for ~100 MB data size and strings lengths <100 bytes. For longer strings the throughput falls to the 1-6 GB/s range.

image

So perhaps we should target data throughputs for long strings to be better than 80% of data throughput for short strings.

GregoryKimball commented 1 year ago

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

davidwendt commented 1 year ago

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

There are several parts of the strings (and nvtext) API implementations that basically resolve into a fragmented set of (pointer,size) pairs which are passed to a factory that performs an optimized gather function to build a contiguous byte array for an output strings column. So a significant impact here may be to improve that gather function by taking advantage of DeviceBatchMemcpy perhaps.

The factory function that uses the gather is illustrated in my blog post which use the custom kernel here: https://github.com/rapidsai/cudf/blob/cd56cc2b4cc47a1d0c63e56fac945a66905c28df/cpp/examples/strings/custom_prealloc.cu#L118

I'll locate other places that libcudf uses this factory function that also include benchmarks.

GregoryKimball commented 5 months ago

Spark-RAPIDS identified long strings performance issues in find (#15405) as well as upper/lower (#15406)