rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.32k stars 888 forks source link

[FEA] Make cudf::size_type 64-bit #3958

Closed harrism closed 2 years ago

harrism commented 4 years ago

Is your feature request related to a problem? Please describe. cudf::size_type is currently an int32_t, which limits column size to two billion elements (MAX_INT). Moreover, it limits child column size to the same. This causes problems, for example, for string columns, where there may be fewer than 2B strings, but the character data to represent them could easily exceed 2B characters.

A 32-bit size was originally chosen to ensure compatibility with Apache Arrow, which dictates that Arrow arrays have a 32-bit size, and that larger arrays are made by chunking into individual Arrays.

Describe the solution you'd like

Describe alternatives you've considered

Chunked columns. This would be very challenging -- supporting chunked columns in every algorithm would result in complex distributed algorithms and implementations, where libcudf currently aims to be communication agnostic / ignorant. In other words, a higher level library handles distributed algorithms.

Additional context

A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(

kkraus14 commented 4 years ago

A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(

🤣

kkraus14 commented 4 years ago

Handle compatibility with Arrow by creating arrow chunked arrays in the libcudf to_arrow interface (not yet created), and combine arrow chunked arrays in the libcudf from_arrow interface. This can be dealt with when we create these APIs.

We have these APIs on the Python side already, so we'd ideally need these APIs sooner rather than later, though I do believe the PyArrow CPU side will handle splitting chunks for us.

felipeblazing commented 4 years ago

This seems like a really reasonable way to ensure that we are compatible with larger columns almost for free. If we want to revisit later that might make sense but right now I can imagine that the discussions alone for going to chunked representation would be long.

revans2 commented 4 years ago

My biggest concern is the amount of memory that this will take and the performance impact it will have. In a lot of the data sets we encounter the length of the strings is small. often very small. Doing this in many of these cases will effectively double the amount of memory used to store a string, and hence cut the performance of all string operations in half, assuming that they are bound by memory access speed. This will also be true for arrays when they show up. Not to mention that we will use a lot more precious GPU memory to store offsets that we could have used to store actual data.

I just want to be sure that if we do this for performance reasons that we do some performance and capacity mesurements before and after to see what impact this will actually have.

wmalpica commented 4 years ago

@revans2 that is a valid concern. We could handle that by having a string datatype variant that stores offsets in int32_t. Similar to how we have different TIMESTAMP datatypes, we could implement STRING variants. This would be more of a future optimization that we dont have to implement right now, but we know that we could do this in the future and not necessarily paint ourselves into a corner so to speak, by adopting a 64-bit size_type

felipeblazing commented 4 years ago

The reason for switching to 64 bit is more about possibility reasons than perf reasons. Right now there are ways to break it having data that is well within the limits of the gpu. I agree there will be a perf penalty which we can evaluate but the other alternatives pretty bleak. Either Don't make your columns too big to the users or taking on another massive refactor project.

ChuckHastings commented 4 years ago

There are a number of cudf functions that use cudf::size_type to create intermediate data structures (e.g. build an array of row offsets in some kernel and then use gather/scatter to create a new table). Doubling the size of cudf::size_type would double the temporary memory used by these methods - essentially halving the size table you could effectively operate on for one GPU.

To get around that you could either change these to use an int32 directly (although as GPU memory sizes increase that would be an artificial limit), or create 32 and 64 bit versions of these internal algorithms and call the 32 bit version if the number of rows < 2 billion. Could make things a bit more complicated.

nvdbaranec commented 4 years ago

Quick scan of the code doesn't turn up much in the way of portability problems. Every *4 or * 4 seems legit. No incorrect usage of 0xffffffff or 0x80000000. The handful of places calling sizeof(size_type) seem legit as well.

Pure mismatches between size_type and int/int32_t are harder to detect without actually compiling testing of course, but at least on the surface level there doesn't appear to be any deep-rooted issues with implicit assumptions that size_type == 4 bytes.

harrism commented 4 years ago

@ChuckHastings do you have a pointer to an example? I think that would be helpful.

harrism commented 4 years ago

To @revans2 's comment, we have actually already run into the case I mentioned of a string column with < 2B strings but > 2B characters. Not saying it's a common case, but it's an ugly situation.

jrhemstad commented 4 years ago

@ChuckHastings do you have a pointer to an example? I think that would be helpful.

Groupby/Join are an example of this. The hash map uses size_type for both the keys and values (since they are row offsets into tables). This would double the size of the hash map for both join and groupby.

harrism commented 4 years ago

How big is the hash map?

ChuckHastings commented 4 years ago

Join creates the following intermediate products:

harrism commented 4 years ago

For the intermediates, could we detect that the column size is < 2B and use int32 for the hash map and VectorPair?

ChuckHastings commented 4 years ago

That's what I was intending by my suggestion above. We could create a template parameter that identifies whether to use int32 or int64 for the hash map and vector pair and then call the correct version depending on the size of the tables.

jakirkham commented 4 years ago

Just as a note, we will want to update any lines like this one that may type this for higher level code (Python and/or Java).

revans2 commented 4 years ago

Yes, java code would have to be updated to handle this. It is not a big deal though and I am happy to do it.

I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue.

But, just to play devil's advocate... When we first started work on Spark and mentioned that there were algorithms in Spark that would spill to disk because it could operate on data larger than fits in host memory we were told that was not really a priority because we could always partition the data into smaller pieces. Why is that not true here too? Why is this a case of ability when you have that option to partition the data into smaller pieces? Yes, I know that cudf itself is not going to be partitioning the data, ala chunked arrow vectors, but Spark chunks the data itself without any cudf knowledge, and I know Dask can as well.

jakirkham commented 4 years ago

Yep agreed. Just making sure that point doesn't get lost in the shuffle (pun not intended 😉).

Yep that's true of both frameworks.

There's a tradeoff between using algorithms that operate on local data vs. distributed data. The increased size would let users tune this more appropriately for their workflows to get the desired performance.

kkraus14 commented 4 years ago

I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue.

I think this is the core issue. If we could have max(int32_t) strings I think everyone would be a happy camper, but having max(int32_t) bytes severely limits the number of elements you can have in a table before needing to partition it where it can adversely impact performance.

harrism commented 4 years ago

The main thing is that libcudf does not want to have to write algorithms that operate on chunked columns. We prefer dask/spark to do the distributed algorithms, and libcudf provide the primitives on which to build them.

If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain. Note that this problem probably affects other nested types too (e.g. lists). @jrhemstad @davidwendt this suggestion goes against the grain of the current design of strings columns / nested columns, where the data array (characters, in the case of strings) is just a child column.

jrhemstad commented 4 years ago

If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain.

It's not really possible to change the size_type for only character columns. It's an all or nothing deal.

davidwendt commented 4 years ago

I noticed a few places in the libcudf code where columns are created with INT32 type but then referenced using data<size_type>(). These, of course, will break if size_type is changed from int32_t to int64_t and would have to be fixed.

Regardless, it got me thinking that there is no reason the chars column needs to be created as INT8. There is nothing preventing creating the column as INT64 and then still referencing the data as data<char>(). This would require the offsets column to also be changed from INT32 to INT64 in order to address the larger chars buffer.

With size_type=32-bits, the strings column would still max out to 2B elements but the maximum total bytes size of the column will go from 2GB to 16GB. Considering that strings are immutable and most strings operations require making a copy, I think a 16GB max strings column size is reasonable.

This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g. create_offsets, concatenate, contiguous_split). The majority of the strings operations should be unaffected since they generally access the column's data using the element<string_view>(idx) method (which requires a minor change). The child columns' offset and size members are not used by design.

I cannot speak to how this would effect the Cython code interface though.

jrhemstad commented 4 years ago

I noticed a few places in the libcudf code where columns are created with INT32 type but then referenced using data<size_type>(). These, of course, will break if size_type is changed from int32_t to int64_t and would have to be fixed.

Regardless, it got me thinking that there is no reason the chars column needs to be created as INT8. There is nothing preventing creating the column as INT64 and then still referencing the data as data<char>(). This would require the offsets column to also be changed from INT32 to INT64 in order to address the larger chars buffer.

With size_type=32-bits, the strings column would still max out to 2B elements but the maximum total bytes size of the column will go from 2GB to 16GB. Considering that strings are immutable and most strings operations require making a copy, I think a 16GB max strings column size is reasonable.

This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g. create_offsets, concatenate, contiguous_split). The majority of the strings operations should be unaffected since they generally access the column's data using the element<string_view>(idx) method (which requires a minor change). The child columns' offset and size members are not used by design.

I cannot speak to how this would effect the Cython code interface though.

I've been mulling over the same idea. Some of the concerns I've had:

davidwendt commented 4 years ago

I had the same thoughts but here is my rationalization:

jrhemstad commented 4 years ago

I had the same thoughts but here is my rationalization:

  • I think it is no more dishonest to say a char column is of type INT8 where the integers here are interpreted as char bytes.
  • If the column is released to the wild on its own, it is not much use without the offsets, nulls, etc. from the other components. And there is no way to distinguish an INT8 integer column from an INT8 char column.
  • I believe a device_buffer is always allocated on an 8-byte boundary so there are already extra bits on every column today whose total byte size is not a multiple of 8.

Those are all good counterpoints. I think it's worth pursuing making the chars child be a wider type if that is the predominant reason we're asking to move to a 64bit size type.

harrism commented 4 years ago

It's definitely not a bad idea. But I suspect at some point we will run into cases where we want more than 2B elements in other types of columns. So this feels like kicking the can down the road (sometimes kicking the can is the right decision). So I'm wondering if anyone foresees the need for >2B total elements in children of other column types we may want to support.

Specifically, this would have to be cases where GPU memory is not the limit. So, for example, one could imagine a column of lists of type INT8, and you could have <2B lists, but total number of INT8s might need to be >2B. The type punning above would technically work here, but it would be more of an ugly hack.

trxcllnt commented 4 years ago

@harrism to your point, Arrow recently added LargeList, LargeUtf8 and LargeBinary types. If we do up size_type to 64-bit, we should be able to consume/produce these types zero-copy in hypothetical from_arrow/to_arrow C++ implementations.

pretzelpy commented 3 years ago

I am trying to read a 170M row parquet file into cudf, but get the error below. My data is mostly numeric, but there I do have 2 columns, addresses and names which contain long strings. I don't have an issue with a smaller parquet file of same dimensions.

Is there any workaround for getting this data loaded into cudf? Or will I have to reduce the total number of characters to fewer than 2B?

RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1601280538305/work/cpp/src/strings/strings_column_factories.cu:56: total size of strings is too large for cudf column

harrism commented 3 years ago

Hi @pretzelpy would you be able to file a new issue with repro instructions? Thanks!

antonyscerri commented 3 years ago

Hi

I’ve just started looking at RAPIDS and the various components today. I was looking at PageRank and applying it to a large directed graph of 1.7B edges. With DiGraph this was runnable and got good performance for calculating PageRank as I was hoping to see. I did want to try it using an undirected variant of the same graph (which the default behaviour for Graph does by creating inverted copies of the edges). However this ran into an issue with what I believe is a limit on the size of a DataFrame (as defined by the maximum size of a column).

I couldn’t find anything explicitly stating the limit in the documentation (I’ve not been through it comprehensively so happy to be pointed at anything). It seems the current limit might be for legacy or compatibility reasons. I was able to find this issue talking about the maximum column size but most of the discussion seemed to be about sizes for data type. This other issue I found also talked about concatenation of large dataframes but is closed, https://github.com/rapidsai/cudf/issues/2474 . I wasn’t sure why the number of rows may be limited to INT32 rather than UINT32 unless it’s to easily support doing basic math comparing dataframe sizes etc where negative numbers may be required, otherwise it would seem that leveraging UINT32 datatype would give double the current limit without necessarily using more storage (part of the discussion around switching to INT64 in that first issue I cited). There was mentioned of compatibility Apache Arrow and the best I could find for that which seems to tie into this is https://arrow.apache.org/docs/cpp/arrays.html#size-limitations-and-recommendations .

I also came across this post which https://medium.com/rapids-ai/tackling-large-graphs-with-rapids-cugraph-and-unified-virtual-memory-b5b69a065d4 talks about the 32bit indexing and future plans to move to 64bit, however I couldn’t find any other related issues on the topic.

So I just wanted to check whether this is a current limitation and not a bug in the python wrapper (but I think it is simply mirroring the definition in the c++ header for libcudf) or if there are any current workarounds (which I don’t believe there are). If that is the current situation is there a pointer to plans for supporting larger data set sizes.

Thanks

Tony

kkranen commented 3 years ago

I'm similarly encountering this error when working with the Outbrains dataset. The dataset has a a large file (page_views.csv) with a string-based secondary key (user id). One of the operations I'm working on compares the latest timestamp of a user/document combo to the timestamp of a given sample to generate an extremely salient feature. Currently due to this issue, the way I have to go about this is:

  1. Load in the file partitioned using NVTabular
  2. Filter down the file to only user ids that occur in samples.
  3. Pass the resulting table to dask_cudf to do groupby+aggregation (with sufficient split-out)
  4. Pass the results of that to NVTabular to do a final set of external joins of the split-out table onto the table of samples

Currently, this takes ~40 lines of moderately arcane code to accomplish. I anticipate that with a change to the offset length, the entire process will be cut down to ~5 lines of easily understandable code (just loads and joins in cudf or dask_cudf)

EvenOldridge commented 3 years ago

We've got a similar issue with another dataset where there's a text column. We're pretty sure that the offset column used in the text representation is overflowing. It's breaking the workflow as soon as we try to load the column into cudf. @gabrielspmoreira can you post the error message?

pretzelpy commented 3 years ago

Hi @pretzelpy would you be able to file a new issue with repro instructions? Thanks!

I ended up using encoding, converting all of the strings to an integer. Then a lookup table that associated the strings with the proxy integer.

I would be happy to help you chase the issue, but my data is confidential customer data. :(

ChuckHastings commented 3 years ago

@antonyscerri - What you are describing is a cugraph limitation. It is a known issue, I believe there is a current timeline for addressing this. I recommend creating a new issue in the cugraph GitHub repo (https://github.com/rapidsai/cugraph/issues) so it can be discussed and tracked there.

davidwendt commented 3 years ago

From the discussions above, I think supporting large columns (strings as well as other types) will require converting cudf::size_type to int64_t. This (almost) automatically makes it possible to support large fixed-width columns (e.g. ints, floats, timestamps, fixed-point) but may require some special logic for groupby/join (as mentioned above) and perhaps partitioning. Overall, this conversion will not be a trivial task and effects not just Python cudf and Spark but upstream repos like clx and cuml. I noticed that Apache Arrow 2.0 docs mention they support int64 for column sizes and null counts so we would still be Arrow compatible.

I started experimenting with changing cudf::size_type to int64_t in libcudf to first see what compile issues it created. So far, I've had to modify over 70 source files (including gtests and gbenchmarks) where we used size_type and int32_t (and int) interchangeably. Many of these are trivial casting. Non-trivial changes are mostly in the strings code but many exist in other places. Along with the hash/groupby issue mentioned previously, I suspect there is a non-trivial impact on cuIO to create and store larger columns as well.

Finally, I agree with the concern about this change effecting column types that have offsets child columns. Changing the offsets columns to INT64 would create larger memory requirements for the same column data we support today. However, we do not have to fix the offsets column type to either INT32 or INT64. The dictionary column type currently supports indices columns of any integer type. It relies on the libcudf indexalator iterator to normalizing access to the integer data in this child column. This is an approach we could take with the offsets column as well -- allowing it to be either INT32 or INT64 type as appropriate; and without needing to create any new types. This would also be a non-trivial change but would be mostly isolated to the specialization logic for strings, lists, structs. Changes could be minimal to places were iterators are used instead of raw pointers as well.

jrhemstad commented 3 years ago

This is an approach we could take with the offsets column as well -- allowing it to be either INT32 or INT64 type as appropriate; and without needing to create any new types. This would also be a non-trivial change but would be mostly isolated to the specialization logic for strings, lists, structs. Changes could be minimal to places were iterators are used instead of raw pointers as well.

This seems like it would be very non-trivial as you'd have to somewhere make the decision when making a new dictionary column about whether you're going to use 32 or 64 bit indices. I don't know how you make that decision without just assuming you always use the largest size, or first introspecting the data to determine what size you need.

The same applies for nested types. At some point you need to make a decision about what size indices you need to use. In general, I don't know how you do that without just always using the biggest size, or running the computation twice and instantiating code paths for all possible required sizes.

nvdbaranec commented 3 years ago

Fixed point types have a similar issue. Knowing it's "fixed point" isn't enough. You need to know the scale value as well, so you can't generalize fixed points types as arithmetic and every place that uses fixed point types needs to be specialized.

I wonder if it would make sense to have an OFFSET type_id that also comes with additional info in the data_type indicating it's size. It would be a big undertaking to get everything that uses offsets to work this way but once you do, you've insulated yourself well.

davidwendt commented 3 years ago

For strings, I've found creating the offsets array by first building the array of sizes and then performing a scan over the sizes to get the offset values is faster than doing a single transform-scan over the sizes-calc-functor (mainly because of the aggressive inlining in thrust). This means if we limit individual strings in a column size type int32 we could first build the sizes column with int32 values and then add a reduce using an int64 accumulator to calculate the total size. So the process introduces an extra reduce step. If the size is small enough, we can scan into the existing int32 array to build the offsets output column. Otherwise, we need to scan into a new int64 array and the original int32 array is then discarded.

Limiting an individual string size to max::int32 seems reasonable since we could not store many of these anyway. But I don't think we could make the same restriction for other types like nested lists.

jlowe commented 3 years ago

But I don't think we could make the same restriction for other types like nested lists.

A list column's offsets are tracking the row offset of the immediate child column, and I think it's quite reasonable to have a limit of 2^31 child rows per parent list row, similar to the 2^31 limit on bytes for a single row of a string column. That's a huge improvement from where we are now.

This would enable a list-of-strings column where the list column offsets are 32-bit but the string column offsets are 64-bit. I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice.

nvdbaranec commented 3 years ago

I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice.

I wouldn't expect it to be too bad. Most things that process lists end up just fobbing off the final leaf work to the appropriate type handler (string, etc). So as long as strings inherently handled 32/64 bit offsets, it should be ok. There's oddball edge cases like contiguous_split which glom string and list offsets together as the same type, but that's relatively easy to fix.

titericz commented 3 years ago

I hit the wall with this string limit (2**32) today when trying to load a large dataset with 185M rows and 2 string features. The dataset comes from Kaggle talkingdata competition. In real life, we can expect much larger datasets, so I believe its very important to expand that limit in cudf .

kkraus14 commented 3 years ago

Hi @titericz, I would highly recommend using a framework built on top of cudf in this situation instead of cudf directly. I.E. using dask-cudf will handle automatically partitioning your data to alleviate this.

harrism commented 3 years ago

To update this issue, we've tentatively decided that we will NOT be changing cudf::size_type in the near term because of the excessive memory use cost this would create for a lot of cases (e.g. every offsets column would be twice as large).

ttnghia commented 3 years ago

Is it possible to have a new big_column column type that uses big_offset_type and big_size_type (int64_t) for storing very large data sets? And the typical column can be converted into big_column whenever needed. Of course there would be many technical difficulties for this, but at least there is something (workaround) that can be a solution for handling large datasets...

harrism commented 3 years ago

So we double our library size and compile time?

EvenOldridge commented 3 years ago

If that's the only option then yes, we do. This problem isn't going away and it's becoming a more common problem in the RecSys space where datasets are massive. This is a huge blocker for adoption of NVTabular by the bigger companies unless we find a solution.

On Mon., Jun. 7, 2021, 12:35 a.m. Mark Harris, @.***> wrote:

So we double our library size and compile time?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rapidsai/cudf/issues/3958#issuecomment-855676254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF5LTIIR7TA67FGZI7W62DTRRZF5ANCNFSM4KMMGCZA .

jrhemstad commented 3 years ago

where datasets are massive. This is a huge blocker for adoption of NVTabular by the bigger companies unless we find a solution.

Overall dataset size doesn't really matter so long as you partition it into small enough pieces.

ttnghia commented 3 years ago

Overall dataset size doesn't really matter so long as you partition it into small enough pieces.

Yeah, this is a good (and general) solution. Given this idea, I think it is a huge benefit to the users to add a utility function that outputs a list of column_view from (const) raw buffers. Similar to a column_view constructor that constructs a column_view from const pointers of data and bitmask:

column_view_base(data_type type,
                   size_type size,
                   void const* data,
                   bitmask_type const* null_mask = nullptr,
                   size_type null_count          = UNKNOWN_NULL_COUNT,
                   size_type offset              = 0);

The utility API (let call it std::vector<column_view> column_views_from_large_buffer() or whatever) will output a list of column_view by segmenting the given buffers and attaching a column_view to each segment. Then for any operations, we can use

auto const column_views = column_views_from_large_buffer();
for(auto const& col : column_views) {
    do_whatever(col);
}

Of course, the users can implement this utility function very easily, but it's very helpful if we provide it.

harrism commented 3 years ago

@EvenOldridge We have discussed this pretty thoroughly and we have pretty much decided that making cudf::size_type 64-bit is not the solution. The reason it's not the solution is because it would double memory usage for columns of offsets (which also use cudf::size_type), which are common in nested data types (strings, lists, structs). Since anyone processing massive datasets with cuDF is likely using a distributed computing framework (e.g. Dask or Spark), we feel the solution is a) to partition large datasets appropriately b) for cuDF to provide the tools necessary for those frameworks to partition when necessary (e.g. compute the output size of a join before materializing the result).

We should probably open a new issue. I made the error of making the title of this issue a proposed solution rather than a description of the problem.

EvenOldridge commented 3 years ago

If this can be solved through partitioning that's awesome. We're working on examples that show the difference in perf and the errors that occur on data where the partitions are too large and that will hopefully prevent a lot of errors on the customer side.

But we commonly run into issues with text data and also occasionally with output sets that are larger than the size_type limits that I'd like to have a clear solution for. Doesn't have to be this. @rjzamora was suggesting that it should be possible to overcome that on our end. Still might be worth thinking about at the cudf level too though for those cases.