[FEA] Story - Improve performance with long strings

GregoryKimball commented 1 year ago

Many strings APIs in libcudf use thread-per-string parallelism in their implementation. This approach works great for processing smaller strings of relatively consistent length. However, for long strings (roughly 256 bytes and above) the performance of thread-per-string algorithms begins to degrade. Some strings APIs are compatible with data-parallel algorithms and can be refactored to improve performance for long strings, while other strings APIs are difficult to refactor with data-parallel algorithms.

Let's use this issue to track the progression: ✅ - this API works well with long strings 🟢 - we think this API will be straightforward to refactor 🟡 - we have some ideas on how to refactor this API, and we'll need to experiment 🔴 - we think this will be very difficult to refactor! ⚪ - long string support is not a priority for this API

Module	Function	Status	Notes
Case	capitalize title is_title to_lower to_upper swapcase	🟡 🟡 🟡 ✅#13142 ✅#13142 ✅#13142
Character Types	all_characters_of_type filter_characters_of_type	✅#13259 🔴
Combining	join_strings concatenate join_list_elements	✅#13283 🟡 🔴
Searching	contains_re matches_re count_re like find_all	🟡 ⚪ 🔴 🟢#13594 🔴
Converting	to_XXXX from_XXXX	⚪ ⚪	these are rarely long strings
Copying	repeat_string repeat_strings	✅ ✅	One overload is an exception
Slicing	slice_strings	✅#13057	One overload allows for skipping characters. Long string support is not a priority for `step > 1 or step < 0`
Finding	find rfind contains starts_with ends_with find_multiple	✅#13226 ✅#13226 ✅#10739 ⚪ ⚪ 🟢
Modifying	pad zfill reverse strip translate filter_characters wrap	🟡 ⚪ 🟡 ✅ 🟡 🔴 🔴
Replacing	replace replace_slice replace_re replace_with_backrefs	✅#12858 🟡 🔴 🔴
Splitting	partition split split_record split_re split_record_re	🟡 ✅#4922 #13680 ✅#12729 🔴 🔴
other	count_characters count_bytes	✅#12779 🟢

Libcudf also includes NVText APIs that will benefit from improvements in performance when processing long strings. Generally long string performance is even more important for our text APIs, where each row could represent a sentence, paragraph or document.

Module	Function	Status	Notes
NGrams	generate_ngrams generate_character_ngrams ngrams_tokenize	⚪ 🟢 🟢#13480	these are generally not long strings
Normalizing	normalize_characters normalize_spaces	🟢 🟢#13480
Stemming	is_letter porter_stemmer_measure	🟢 🟢
Edit Distance	edit_distance edit_distance_matrix	⚪ ⚪	these are generally not long strings
Tokenizing	byte_pair_encoding subword_tokenize tokenize count_tokens character_tokenize detokenize	🟡 🟡 🟢#13480 🟢#13480 🟡 🟡
Replacing	replace_tokens filter_tokens	🟢#13480 🟢#13480
MinHashing	minhash	✅#13333

bdice commented 1 year ago

This is a great use of a "Story" issue for documenting this work. I have some very high level comments / ideas:

Some of these algorithms might benefit from using Cooperative Groups and/or studying implementations of related algorithms in CUB.
For some of these cases, I suspect that "prefix strings" as described in the Velox paper could be useful. That would use a (non-Arrow) layout with a fixed-width prefix that could be computed as a temporary column used to quickly eliminate possible matches. Filtering and ordering are specifically called out in that paper.

GregoryKimball commented 1 year ago

After studying the results of the strings microbenchmarks in libcudf, I'd like to share some results and early thoughts on the performance we should target with long strings.

First, here is data throughput versus data size for the split_record API. It shows about 15-35 GB/s throughput at the ~100 MB data size for string lengths between 32 and 8192 bytes.

Next, here is the same plot for filter_characters_of_type. It shows 20 GB/s throughput for ~100 MB data size and strings lengths <100 bytes. For longer strings the throughput falls to the 1-6 GB/s range.

So perhaps we should target data throughputs for long strings to be better than 80% of data throughput for short strings.

GregoryKimball commented 1 year ago

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

davidwendt commented 1 year ago

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

There are several parts of the strings (and nvtext) API implementations that basically resolve into a fragmented set of (pointer,size) pairs which are passed to a factory that performs an optimized gather function to build a contiguous byte array for an output strings column. So a significant impact here may be to improve that gather function by taking advantage of DeviceBatchMemcpy perhaps.

The factory function that uses the gather is illustrated in my blog post which use the custom kernel here: https://github.com/rapidsai/cudf/blob/cd56cc2b4cc47a1d0c63e56fac945a66905c28df/cpp/examples/strings/custom_prealloc.cu#L118

I'll locate other places that libcudf uses this factory function that also include benchmarks.

GregoryKimball commented 5 months ago

Spark-RAPIDS identified long strings performance issues in find (#15405) as well as upper/lower (#15406)

rapidsai / cudf

[FEA] Story - Improve performance with long strings #13048