Closed harrism closed 2 years ago
A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(
🤣
Handle compatibility with Arrow by creating arrow chunked arrays in the libcudf to_arrow interface (not yet created), and combine arrow chunked arrays in the libcudf from_arrow interface. This can be dealt with when we create these APIs.
We have these APIs on the Python side already, so we'd ideally need these APIs sooner rather than later, though I do believe the PyArrow CPU side will handle splitting chunks for us.
This seems like a really reasonable way to ensure that we are compatible with larger columns almost for free. If we want to revisit later that might make sense but right now I can imagine that the discussions alone for going to chunked representation would be long.
My biggest concern is the amount of memory that this will take and the performance impact it will have. In a lot of the data sets we encounter the length of the strings is small. often very small. Doing this in many of these cases will effectively double the amount of memory used to store a string, and hence cut the performance of all string operations in half, assuming that they are bound by memory access speed. This will also be true for arrays when they show up. Not to mention that we will use a lot more precious GPU memory to store offsets that we could have used to store actual data.
I just want to be sure that if we do this for performance reasons that we do some performance and capacity mesurements before and after to see what impact this will actually have.
@revans2 that is a valid concern. We could handle that by having a string datatype variant that stores offsets in int32_t. Similar to how we have different TIMESTAMP datatypes, we could implement STRING variants. This would be more of a future optimization that we dont have to implement right now, but we know that we could do this in the future and not necessarily paint ourselves into a corner so to speak, by adopting a 64-bit size_type
The reason for switching to 64 bit is more about possibility reasons than perf reasons. Right now there are ways to break it having data that is well within the limits of the gpu. I agree there will be a perf penalty which we can evaluate but the other alternatives pretty bleak. Either Don't make your columns too big to the users or taking on another massive refactor project.
There are a number of cudf functions that use cudf::size_type to create intermediate data structures (e.g. build an array of row offsets in some kernel and then use gather/scatter to create a new table). Doubling the size of cudf::size_type would double the temporary memory used by these methods - essentially halving the size table you could effectively operate on for one GPU.
To get around that you could either change these to use an int32 directly (although as GPU memory sizes increase that would be an artificial limit), or create 32 and 64 bit versions of these internal algorithms and call the 32 bit version if the number of rows < 2 billion. Could make things a bit more complicated.
Quick scan of the code doesn't turn up much in the way of portability problems. Every *4
or * 4
seems legit. No incorrect usage of 0xffffffff
or 0x80000000
. The handful of places calling sizeof(size_type) seem legit as well.
Pure mismatches between size_type and int/int32_t are harder to detect without actually compiling testing of course, but at least on the surface level there doesn't appear to be any deep-rooted issues with implicit assumptions that size_type == 4 bytes.
@ChuckHastings do you have a pointer to an example? I think that would be helpful.
To @revans2 's comment, we have actually already run into the case I mentioned of a string column with < 2B strings but > 2B characters. Not saying it's a common case, but it's an ugly situation.
@ChuckHastings do you have a pointer to an example? I think that would be helpful.
Groupby/Join are an example of this. The hash map uses size_type
for both the keys and values (since they are row offsets into tables). This would double the size of the hash map for both join and groupby.
How big is the hash map?
Join creates the following intermediate products:
pair<cudf::size_type,cudf::size_type>
with 2 * (number of rows in the smaller table) elements. The 2x size is the default - which is not currently overridden by the implementation.cudf::size_type
elements for each row of output in the resulting join (technically we estimate the size, so it could be slightly larger than this).For the intermediates, could we detect that the column size is < 2B and use int32 for the hash map and VectorPair?
That's what I was intending by my suggestion above. We could create a template parameter that identifies whether to use int32 or int64 for the hash map and vector pair and then call the correct version depending on the size of the tables.
Just as a note, we will want to update any lines like this one that may type this for higher level code (Python and/or Java).
Yes, java code would have to be updated to handle this. It is not a big deal though and I am happy to do it.
I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue.
But, just to play devil's advocate... When we first started work on Spark and mentioned that there were algorithms in Spark that would spill to disk because it could operate on data larger than fits in host memory we were told that was not really a priority because we could always partition the data into smaller pieces. Why is that not true here too? Why is this a case of ability when you have that option to partition the data into smaller pieces? Yes, I know that cudf itself is not going to be partitioning the data, ala chunked arrow vectors, but Spark chunks the data itself without any cudf knowledge, and I know Dask can as well.
Yep agreed. Just making sure that point doesn't get lost in the shuffle (pun not intended 😉).
Yep that's true of both frameworks.
There's a tradeoff between using algorithms that operate on local data vs. distributed data. The increased size would let users tune this more appropriately for their workflows to get the desired performance.
I have also run into situations where a string column cannot be stored because there are too many characters in it, so I know that it is a real issue.
I think this is the core issue. If we could have max(int32_t) strings I think everyone would be a happy camper, but having max(int32_t) bytes severely limits the number of elements you can have in a table before needing to partition it where it can adversely impact performance.
The main thing is that libcudf does not want to have to write algorithms that operate on chunked columns. We prefer dask/spark to do the distributed algorithms, and libcudf provide the primitives on which to build them.
If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain. Note that this problem probably affects other nested types too (e.g. lists). @jrhemstad @davidwendt this suggestion goes against the grain of the current design of strings columns / nested columns, where the data array (characters, in the case of strings) is just a child column.
If we can solve the strings column problem by having a different size_type for the size of character arrays, that might reduce the current pain.
It's not really possible to change the size_type
for only character columns. It's an all or nothing deal.
I noticed a few places in the libcudf code where columns are created with INT32
type but then referenced using data<size_type>()
. These, of course, will break if size_type
is changed from int32_t
to int64_t
and would have to be fixed.
Regardless, it got me thinking that there is no reason the chars column needs to be created as INT8
. There is nothing preventing creating the column as INT64
and then still referencing the data as data<char>()
. This would require the offsets column to also be changed from INT32
to INT64
in order to address the larger chars buffer.
With size_type=32-bits
, the strings column would still max out to 2B elements but the maximum total bytes size of the column will go from 2GB to 16GB. Considering that strings are immutable and most strings operations require making a copy, I think a 16GB max strings column size is reasonable.
This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g. create_offsets
, concatenate
, contiguous_split
). The majority of the strings operations should be unaffected since they generally access the column's data using the element<string_view>(idx)
method (which requires a minor change). The child columns' offset
and size
members are not used by design.
I cannot speak to how this would effect the Cython code interface though.
I noticed a few places in the libcudf code where columns are created with
INT32
type but then referenced usingdata<size_type>()
. These, of course, will break ifsize_type
is changed fromint32_t
toint64_t
and would have to be fixed.Regardless, it got me thinking that there is no reason the chars column needs to be created as
INT8
. There is nothing preventing creating the column asINT64
and then still referencing the data asdata<char>()
. This would require the offsets column to also be changed fromINT32
toINT64
in order to address the larger chars buffer.With
size_type=32-bits
, the strings column would still max out to 2B elements but the maximum total bytes size of the column will go from 2GB to 16GB. Considering that strings are immutable and most strings operations require making a copy, I think a 16GB max strings column size is reasonable.This change would only effect strings code. Also, I think the change would be minimal -- effecting mostly the factories used to create strings columns and any code that access offsets and chars directly (e.g.
create_offsets
,concatenate
,contiguous_split
). The majority of the strings operations should be unaffected since they generally access the column's data using theelement<string_view>(idx)
method (which requires a minor change). The child columns'offset
andsize
members are not used by design.I cannot speak to how this would effect the Cython code interface though.
I've been mulling over the same idea. Some of the concerns I've had:
INT64
, one would expect it's elements to correspond to 8 byte integer values. But instead, by convention, we're saying this columns elements are actually 1 byte integers. column
was ever released from it's string column
owner. There would no longer be any way to differentiate a "real" INT64
column from the type-punned chars child. char
is one of the few cases where type punning is well-defined, but my hackles rise any time type punning is going to be used like this. I had the same thoughts but here is my rationalization:
char
column is of type INT8
where the integers here are interpreted as char
bytes.INT8
integer column from an INT8
char
column.device_buffer
is always allocated on an 8-byte boundary so there are already extra bits on every column today whose total byte size is not a multiple of 8.I had the same thoughts but here is my rationalization:
- I think it is no more dishonest to say a
char
column is of typeINT8
where the integers here are interpreted aschar
bytes.- If the column is released to the wild on its own, it is not much use without the offsets, nulls, etc. from the other components. And there is no way to distinguish an
INT8
integer column from anINT8
char
column.- I believe a
device_buffer
is always allocated on an 8-byte boundary so there are already extra bits on every column today whose total byte size is not a multiple of 8.
Those are all good counterpoints. I think it's worth pursuing making the chars child be a wider type if that is the predominant reason we're asking to move to a 64bit size type.
It's definitely not a bad idea. But I suspect at some point we will run into cases where we want more than 2B elements in other types of columns. So this feels like kicking the can down the road (sometimes kicking the can is the right decision). So I'm wondering if anyone foresees the need for >2B total elements in children of other column types we may want to support.
Specifically, this would have to be cases where GPU memory is not the limit. So, for example, one could imagine a column of lists of type INT8, and you could have <2B lists, but total number of INT8s might need to be >2B. The type punning above would technically work here, but it would be more of an ugly hack.
@harrism to your point, Arrow recently added LargeList
, LargeUtf8
and LargeBinary
types. If we do up size_type
to 64-bit, we should be able to consume/produce these types zero-copy in hypothetical from_arrow/to_arrow
C++ implementations.
I am trying to read a 170M row parquet file into cudf, but get the error below. My data is mostly numeric, but there I do have 2 columns, addresses and names which contain long strings. I don't have an issue with a smaller parquet file of same dimensions.
Is there any workaround for getting this data loaded into cudf? Or will I have to reduce the total number of characters to fewer than 2B?
RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1601280538305/work/cpp/src/strings/strings_column_factories.cu:56: total size of strings is too large for cudf column
Hi @pretzelpy would you be able to file a new issue with repro instructions? Thanks!
Hi
I’ve just started looking at RAPIDS and the various components today. I was looking at PageRank and applying it to a large directed graph of 1.7B edges. With DiGraph this was runnable and got good performance for calculating PageRank as I was hoping to see. I did want to try it using an undirected variant of the same graph (which the default behaviour for Graph does by creating inverted copies of the edges). However this ran into an issue with what I believe is a limit on the size of a DataFrame (as defined by the maximum size of a column).
I couldn’t find anything explicitly stating the limit in the documentation (I’ve not been through it comprehensively so happy to be pointed at anything). It seems the current limit might be for legacy or compatibility reasons. I was able to find this issue talking about the maximum column size but most of the discussion seemed to be about sizes for data type. This other issue I found also talked about concatenation of large dataframes but is closed, https://github.com/rapidsai/cudf/issues/2474 . I wasn’t sure why the number of rows may be limited to INT32 rather than UINT32 unless it’s to easily support doing basic math comparing dataframe sizes etc where negative numbers may be required, otherwise it would seem that leveraging UINT32 datatype would give double the current limit without necessarily using more storage (part of the discussion around switching to INT64 in that first issue I cited). There was mentioned of compatibility Apache Arrow and the best I could find for that which seems to tie into this is https://arrow.apache.org/docs/cpp/arrays.html#size-limitations-and-recommendations .
I also came across this post which https://medium.com/rapids-ai/tackling-large-graphs-with-rapids-cugraph-and-unified-virtual-memory-b5b69a065d4 talks about the 32bit indexing and future plans to move to 64bit, however I couldn’t find any other related issues on the topic.
So I just wanted to check whether this is a current limitation and not a bug in the python wrapper (but I think it is simply mirroring the definition in the c++ header for libcudf) or if there are any current workarounds (which I don’t believe there are). If that is the current situation is there a pointer to plans for supporting larger data set sizes.
Thanks
Tony
I'm similarly encountering this error when working with the Outbrains dataset. The dataset has a a large file (page_views.csv) with a string-based secondary key (user id). One of the operations I'm working on compares the latest timestamp of a user/document combo to the timestamp of a given sample to generate an extremely salient feature. Currently due to this issue, the way I have to go about this is:
Currently, this takes ~40 lines of moderately arcane code to accomplish. I anticipate that with a change to the offset length, the entire process will be cut down to ~5 lines of easily understandable code (just loads and joins in cudf or dask_cudf)
We've got a similar issue with another dataset where there's a text column. We're pretty sure that the offset column used in the text representation is overflowing. It's breaking the workflow as soon as we try to load the column into cudf. @gabrielspmoreira can you post the error message?
Hi @pretzelpy would you be able to file a new issue with repro instructions? Thanks!
I ended up using encoding, converting all of the strings to an integer. Then a lookup table that associated the strings with the proxy integer.
I would be happy to help you chase the issue, but my data is confidential customer data. :(
@antonyscerri - What you are describing is a cugraph limitation. It is a known issue, I believe there is a current timeline for addressing this. I recommend creating a new issue in the cugraph GitHub repo (https://github.com/rapidsai/cugraph/issues) so it can be discussed and tracked there.
From the discussions above, I think supporting large columns (strings as well as other types) will require converting cudf::size_type
to int64_t
. This (almost) automatically makes it possible to support large fixed-width columns (e.g. ints, floats, timestamps, fixed-point) but may require some special logic for groupby/join (as mentioned above) and perhaps partitioning. Overall, this conversion will not be a trivial task and effects not just Python cudf and Spark but upstream repos like clx and cuml. I noticed that Apache Arrow 2.0 docs mention they support int64 for column sizes and null counts so we would still be Arrow compatible.
I started experimenting with changing cudf::size_type
to int64_t
in libcudf to first see what compile issues it created. So far, I've had to modify over 70 source files (including gtests and gbenchmarks) where we used size_type
and int32_t
(and int
) interchangeably. Many of these are trivial casting. Non-trivial changes are mostly in the strings code but many exist in other places. Along with the hash/groupby issue mentioned previously, I suspect there is a non-trivial impact on cuIO to create and store larger columns as well.
Finally, I agree with the concern about this change effecting column types that have offsets child columns. Changing the offsets columns to INT64
would create larger memory requirements for the same column data we support today. However, we do not have to fix the offsets column type to either INT32
or INT64
. The dictionary column type currently supports indices columns of any integer type. It relies on the libcudf indexalator
iterator to normalizing access to the integer data in this child column. This is an approach we could take with the offsets column as well -- allowing it to be either INT32
or INT64
type as appropriate; and without needing to create any new types. This would also be a non-trivial change but would be mostly isolated to the specialization logic for strings, lists, structs. Changes could be minimal to places were iterators are used instead of raw pointers as well.
This is an approach we could take with the offsets column as well -- allowing it to be either
INT32
orINT64
type as appropriate; and without needing to create any new types. This would also be a non-trivial change but would be mostly isolated to the specialization logic for strings, lists, structs. Changes could be minimal to places were iterators are used instead of raw pointers as well.
This seems like it would be very non-trivial as you'd have to somewhere make the decision when making a new dictionary column about whether you're going to use 32 or 64 bit indices. I don't know how you make that decision without just assuming you always use the largest size, or first introspecting the data to determine what size you need.
The same applies for nested types. At some point you need to make a decision about what size indices you need to use. In general, I don't know how you do that without just always using the biggest size, or running the computation twice and instantiating code paths for all possible required sizes.
Fixed point types have a similar issue. Knowing it's "fixed point" isn't enough. You need to know the scale value as well, so you can't generalize fixed points types as arithmetic and every place that uses fixed point types needs to be specialized.
I wonder if it would make sense to have an OFFSET type_id that also comes with additional info in the data_type indicating it's size. It would be a big undertaking to get everything that uses offsets to work this way but once you do, you've insulated yourself well.
For strings, I've found creating the offsets array by first building the array of sizes and then performing a scan over the sizes to get the offset values is faster than doing a single transform-scan over the sizes-calc-functor (mainly because of the aggressive inlining in thrust). This means if we limit individual strings in a column size type int32 we could first build the sizes column with int32 values and then add a reduce using an int64 accumulator to calculate the total size. So the process introduces an extra reduce step. If the size is small enough, we can scan into the existing int32 array to build the offsets output column. Otherwise, we need to scan into a new int64 array and the original int32 array is then discarded.
Limiting an individual string size to max::int32 seems reasonable since we could not store many of these anyway. But I don't think we could make the same restriction for other types like nested lists.
But I don't think we could make the same restriction for other types like nested lists.
A list column's offsets are tracking the row offset of the immediate child column, and I think it's quite reasonable to have a limit of 2^31 child rows per parent list row, similar to the 2^31 limit on bytes for a single row of a string column. That's a huge improvement from where we are now.
This would enable a list-of-strings column where the list column offsets are 32-bit but the string column offsets are 64-bit. I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice.
I don't know enough about the libcudf nested type offsets-of-offsets handling to see if that's going to be problematic in practice.
I wouldn't expect it to be too bad. Most things that process lists end up just fobbing off the final leaf work to the appropriate type handler (string, etc). So as long as strings inherently handled 32/64 bit offsets, it should be ok. There's oddball edge cases like contiguous_split which glom string and list offsets together as the same type, but that's relatively easy to fix.
I hit the wall with this string limit (2**32) today when trying to load a large dataset with 185M rows and 2 string features. The dataset comes from Kaggle talkingdata competition. In real life, we can expect much larger datasets, so I believe its very important to expand that limit in cudf .
Hi @titericz, I would highly recommend using a framework built on top of cudf in this situation instead of cudf directly. I.E. using dask-cudf
will handle automatically partitioning your data to alleviate this.
To update this issue, we've tentatively decided that we will NOT be changing cudf::size_type
in the near term because of the excessive memory use cost this would create for a lot of cases (e.g. every offsets column would be twice as large).
Is it possible to have a new big_column
column type that uses big_offset_type
and big_size_type
(int64_t
) for storing very large data sets? And the typical column
can be converted into big_column
whenever needed. Of course there would be many technical difficulties for this, but at least there is something (workaround) that can be a solution for handling large datasets...
So we double our library size and compile time?
If that's the only option then yes, we do. This problem isn't going away and it's becoming a more common problem in the RecSys space where datasets are massive. This is a huge blocker for adoption of NVTabular by the bigger companies unless we find a solution.
On Mon., Jun. 7, 2021, 12:35 a.m. Mark Harris, @.***> wrote:
So we double our library size and compile time?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rapidsai/cudf/issues/3958#issuecomment-855676254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGF5LTIIR7TA67FGZI7W62DTRRZF5ANCNFSM4KMMGCZA .
where datasets are massive. This is a huge blocker for adoption of NVTabular by the bigger companies unless we find a solution.
Overall dataset size doesn't really matter so long as you partition it into small enough pieces.
Overall dataset size doesn't really matter so long as you partition it into small enough pieces.
Yeah, this is a good (and general) solution. Given this idea, I think it is a huge benefit to the users to add a utility function that outputs a list of column_view
from (const) raw buffers. Similar to a column_view
constructor that constructs a column_view
from const pointers of data and bitmask:
column_view_base(data_type type,
size_type size,
void const* data,
bitmask_type const* null_mask = nullptr,
size_type null_count = UNKNOWN_NULL_COUNT,
size_type offset = 0);
The utility API (let call it std::vector<column_view> column_views_from_large_buffer()
or whatever) will output a list of column_view by segmenting the given buffers and attaching a column_view to each segment. Then for any operations, we can use
auto const column_views = column_views_from_large_buffer();
for(auto const& col : column_views) {
do_whatever(col);
}
Of course, the users can implement this utility function very easily, but it's very helpful if we provide it.
@EvenOldridge We have discussed this pretty thoroughly and we have pretty much decided that making cudf::size_type
64-bit is not the solution. The reason it's not the solution is because it would double memory usage for columns of offsets (which also use cudf::size_type
), which are common in nested data types (strings, lists, structs). Since anyone processing massive datasets with cuDF is likely using a distributed computing framework (e.g. Dask or Spark), we feel the solution is
a) to partition large datasets appropriately
b) for cuDF to provide the tools necessary for those frameworks to partition when necessary (e.g. compute the output size of a join before materializing the result).
We should probably open a new issue. I made the error of making the title of this issue a proposed solution rather than a description of the problem.
If this can be solved through partitioning that's awesome. We're working on examples that show the difference in perf and the errors that occur on data where the partitions are too large and that will hopefully prevent a lot of errors on the customer side.
But we commonly run into issues with text data and also occasionally with output sets that are larger than the size_type limits that I'd like to have a clear solution for. Doesn't have to be this. @rjzamora was suggesting that it should be possible to overcome that on our end. Still might be worth thinking about at the cudf level too though for those cases.
Is your feature request related to a problem? Please describe.
cudf::size_type
is currently anint32_t
, which limits column size to two billion elements (MAX_INT
). Moreover, it limits child column size to the same. This causes problems, for example, for string columns, where there may be fewer than 2B strings, but the character data to represent them could easily exceed 2B characters.A 32-bit size was originally chosen to ensure compatibility with Apache Arrow, which dictates that Arrow arrays have a 32-bit size, and that larger arrays are made by chunking into individual Arrays.
Describe the solution you'd like
Change
size_type
to be anint64_t
.Handle compatibility with Arrow by creating arrow chunked arrays in the libcudf
to_arrow
interface (not yet created), and combine arrow chunked arrays in the libcudffrom_arrow
interface. This can be dealt with when we create these APIs.Describe alternatives you've considered
Chunked columns. This would be very challenging -- supporting chunked columns in every algorithm would result in complex distributed algorithms and implementations, where libcudf currently aims to be communication agnostic / ignorant. In other words, a higher level library handles distributed algorithms.
Additional context
A potential downside: @felipeblazing called us brave for considering supporting chunked columns. If we implement this feature request, perhaps he will not consider us quite so brave. :(