rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.27k stars 884 forks source link

[DISCUSSION] libcudf column abstraction redesign #1443

Closed jrhemstad closed 4 years ago

jrhemstad commented 5 years ago

Creating and interfacing with the gdf_column C struct is rife with problems. We need better abstraction(s) to make libcudf a safer and more pleasant library to use and develop.

In lieu of adding support for any new column types, the initial goal of a cudf::column design is to ease the current pain points in creating and interfacing with the column data structure in libcudf.

Goals:

Non-Goals

Note that a “Non-Goal” is not something that we want to expressly forbid in our redesign, but rather are not the focus of the current effort. Whenever possible, we can make design decisions that will enable these “Non-Goals” sometime in the future, so long as those decisions do not compromise the timely accomplishment of the above “Goals”

Process

1. Gather pain points

2. Derive requirements

3. Design Mock Up

4. Implementation

5. Initial Refactor

6. Full Refactor

jrhemstad commented 5 years ago

My pain points:

  1. Users are burdened with explicitly allocating/freeing device memory for columns. Ownership of the memory is ill-defined and inconsistent.
  2. Basic interfacing with a column requires touching raw pointers. I.e., required to do things like static_cast<T>(col->data), if(nullptr == col->valid), gdf_is_valid(col->valid, i), etc.
  3. Allocating and interacting with the null bitmask requires explicit action by users such as calling gdf_valid_allocation_size and gdf_num_bitmask_elements.
kkraus14 commented 5 years ago

Pain Points

  1. Numba is currently used all over for memory allocation / deallocation and would MUCH rather that owned by libcudf but maintain interoperability with Numba Device Arrays.
  2. Python column classes are completely disjointed from libcudf. Should ideally reference a cudf::column object to be used via Cython for better cudf<->libcudf interop.
  3. Type specific implementations currently exist in Python for types like Strings and Categoricals, but ideally should be in cudf::column for better maintainability and cross-language functionality.
eyalroz commented 5 years ago

Pain points

  1. Baked-in meta-data/statistics: A null count might be a useful thing, but - so are the min and max values, or the number of distinct values. These should be in a separate data structure, or some mix-in that a subclass of the column class could adopt.
  2. No single-bit-width columns. Factor-8 waste of space? That's outrageous; egregious; preposterous.
  3. Huge amount of idiosyncratic code for the validity indicators. We could just use the fact that it's another (non-nullable) column and invoke existing code (which should support 1-bit-wide columns).
  4. No compositionality. For example, a column of strings (or variable-length data) could be representedby a column of lengths and a column of offsets, offsets into a column of single characters. (not the only possible representation); a nullable column - discussed in the previous item (and again, that's not the only reasonable representation) <- Addressing this fully would be a gargantuan task. But ignoring it completely is not a good idea either.
  5. A supposed "column" today can be an invalid state, so you have to check columns whenever you get them in.
  6. Not using modern C++ vocabulary types like optional and variant to express aspects of columns' semantics. It's the idiomatic thing to do.
  7. Difficult to switch from working with type-erased to type-imbued columns and vice versa.
  8. Mutability - it isn't well defined, nor enforced, who and when, if at all, should be able to change a column (the host-side members and the on-device data) once it's been constructed.
  9. Column names. This may be a nice feature for debugging, but in reality - columns would be identified by some sort of taxonomy, e.g. schema+table+intra-table name - not by a single name. And it's not very useful to just keep the third name. We certainly don't want people to start encoding complex structures into the name. And in our code - we don't know what names to give output columns anyway. So let's drop the names
harrism commented 5 years ago

Pain Points

Mostly reiterating what I think are the most important things to fix first. Basically, the problem is that gdf_column lacks encapsulation.

  1. Explicit, manual, and low-level allocation and deallocation of column memory. Numba ~is~was used for this on the Python side, which required RMM to provide a monkey-patched numba workalike API for Python-side numba.cuda.device_array allocation.
  2. Returning new columns from cudf algorithms is risky because most routines still take pre-allocated column memory as an argument. Users may think that because it's returned they don't have to free the memory, which would lead to leaks.
  3. Managing valid masks is error prone and explicit.
  4. Because it's not a proper class, the API for columns is via free functions, which mostly have silly names and inconsistent interfaces.
revans2 commented 5 years ago
  1. It is not clear who owns a block of memory and who does not.
    • It is nice to have the ability to ignore the functions provided and take ownership of the memory to do things like coalescing allocations and host to the device transfers.
  2. Documentation. I am okay with having the guts of the column on display. The issue is that I have to reach into the guts to move data back and forth between the host and device, and it is not explained anywhere fully what those guts really are. What the contract is I have to follow. This is really mostly around strings and string categories where you have to special case them. But it becomes more critical as we add in things outside of the scope of this redesign.
  3. Sorry I really only have 2 for now.
harrism commented 5 years ago

I have an idea regarding step 5. Revise to:

5. Initial Refactor

  1. Two candidate libcudf features shall be chosen for refactoring to use the new cudf::column abstraction
  2. Two developers (TBD) will take responsibility for refactoring the features (one each) to use the newly designed abstraction(s) and submitting a cuDF PR for review. At least one of the developers shall be different from the developer who designed and implemented the column abstraction.
  3. Any required design changes exposed in refactoring shall be discussed in the PR...

I think this will help flush out more issues in the abstraction. We could keep it at one candidate feature, done by a different engineer than the column implementer, but two would be nice.

harrism commented 5 years ago

I also would like to tighten up the schedule a bit. I would like to get the basic abstraction designed and implemented by 0.8 release. If we can't do that, then I think the scope is too broad and should be narrowed. Since most of our pain points agree, I don't think the scope is too large at this point.

jrhemstad commented 5 years ago

Updated with @harrism's feedback about schedule and the initial refactor process.

jlowe commented 5 years ago

Pain Points

  1. Device memory management is explicit and error-prone, and I would like to see automated output allocation. However there are some use-cases with pre-allocated output targets, and it would be nice to not preclude those.
  2. Related to the previous, the management of CPU-side objects associated with gdf_column can be messy. Who is responsible for freeing the column name? Should it be possible to create a string category column without a corresponding NVCategory and if/when that should be freed when the column is freed?
  3. Validity masks are exposed more than I think they need to be.
  4. The lack of a C++ wrapper that bundles the relevant methods and simplifies proper handling of resource allocations within code that can throw.
eyalroz commented 5 years ago

@jlowe and everyone else: About column names - do we actually use column names at all in libcudf? I'm not asking about Python code, just the C/C++.

kovaltan commented 5 years ago

My pain points:

  1. bitmap iterator and/or class (validity bitmask handling nulls) The rack of bitmap iterator makes the implementation messy when the API handles the nulls. It would be easy to handle nulls if we use const bitmap iterator with transform iterator. Also validity bitmask class will be useful to manage nulls including allocating/deallocating helper

  2. input colum check at each function gdf functions usually have sanity check for input/output columns, it's better if cudf::column has sanity check API: GDF_REQUIRE(nullptr != col_in, GDF_DATASET_EMPTY); GDF_REQUIRE(nullptr == col_in->valid || 0 == col_in->null_count, GDF_VALIDITY_UNSUPPORTED); or CUDF_EXPECTS(col != nullptr, "Input column is null");

  3. Returning new columns from cudf algorithms is risky Same as @harrism The allocation of the columns should be out side of the algorithms.

  4. cudf::test::column_wrapper does not accept integer initializer like {0,1,2,3} It is out of topic, it's better we have.

eyalroz commented 5 years ago

@kovaltan :

kovaltan commented 5 years ago

@eyalroz ,

  • You suggest that the allocation of columns should occur outside of "the algorithms" (by which I take it you mean the implementation of cudf API functions). But is that a pain point about the column abstraction we use? And - do you think this should be resolved through the design of a column class?

I mean "the algorithms" like cudf::scan(), does not mean cudf::column class. For now, we passes pre-allocate output column to cudf::scan(). I think it should be pre-allocated, should not be allocated in cudf::scan(). cudf::column class can allocate device memory, and may have device_memory_owner flag. cudf::column should free the device memory if device_memory_owner = true.

  • About the input checks: Note that, for now, we have PR #1390, and specifically validate(my_gdf_column) which performs all of these checks.

It's really nice!

jlowe commented 5 years ago

@eyalroz

About column names - do we actually use column names at all in libcudf?

The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller. However I could see some debate whether it really needs to be stored directly in the column structure.

revans2 commented 5 years ago

@eyalroz and @kovaltan I do agree that a column should have some knowledge about ownership of the memory it references. I also realize that

Support delayed materialization or lazy evaluation

is not a part of the design goals here. But I think it will become an important feature likely combined with JITing so we should at least think ahead to how this might work and not paint ourselves into a corner. One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory.

So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation. I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it. That should give us the flexibility to do just about anything we want with memory management of the columns, but also lets the algorithms themselves decide when to allocate memory, and even how much memory to allocate.

jrhemstad commented 5 years ago

Returning new columns from cudf algorithms is risky Same as @harrism The allocation of the columns should be out side of the algorithms.

@kovaltan I think you may have misunderstood Mark's comment here:

Returning new columns from cudf algorithms is risky because most routines still take pre-allocated column memory as an argument. Users may think that because it's returned they don't have to free the memory, which would lead to leaks.

I believe he was referring to how presently returning a gdf_column from a function is risky because we don't have any gdf_column destructor, so the burden falls on the caller to free the column's underlying memory.

For example:

// This is leaky because it requires the caller to remember to free the column's device memory
gdf_column output = gdf_algorithm(input...);

However, we have identified that is desirable to be able to return outputs from functions rather than use output parameters (for various reasons: clearer style, delayed JIT evaluation, etc.). Therefore, we want to make that safer via ownership semantics.

As such, we likely will be moving in the direction that algorithms allocate the device memory for outputs.

To be clear:

// This is safe because lifetime management of the device memory is automated
cudf::column output = cudf::algorithm( input... );
eyalroz commented 5 years ago

@revans2 :

I ... realize that .... delayed materialization or lazy evaluation ... is not a part of the design goals here. But I think it will become an important feature ... so we should at least think ahead to how this might work and not paint ourselves into a corner.

You are preaching to the choir - I definitely agree (in fact I would love to work on faster JITing using PTX, regardless of cudf). This part of why I've proposed removing some of the features of columns for the basic/fundamental abstraction - flexibility for later use.

One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory. So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation.

But why are we certain that the allocating code will be part of what gets JITed together? Clearly, to construct operators by JITting we would use composable building-blocks which are not the functions in cudf/functions.h. So I would not be worried about this point.

I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it.

I'm thinking about a design which is reminiscent of std::unique_ptr - where a choice of a different template parameter allows for a different de/allocation mechanism. If I followed you correctly, I think that should allow for the behavior you described.

eyalroz commented 5 years ago

@jlowe :

The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller.

Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all?

harrism commented 5 years ago

Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all?

Since when? The CSV, parquet, and soon ORC and Jason-lines parsers are all part of libcudf.

eyalroz commented 5 years ago

@harrism : Yes, right, I misspoke. That is part of the libcudf codebase. But the library itself, or rather the API functions which process columns, don't take CSVs and parse them, nor do they make calls to CSV-parsing code. And the names aren't used (it seems).

So, what if the names were to simply be parsed into a separate array in the csv_read_args structure?

kovaltan commented 5 years ago

@jrhemstad

Ah, I've misunderstood Mark's comment. I have only one concern about allocating device memory inside of cudf::algorithm(). If allocating is in cudf::algorithm(), it's hard for users to manage device memory allocation by themselves, though I'm not sure if the users want, and there would be some way to pass user defined allocator to cudf::algorithm().

''' // This is safe because lifetime management of the device memory is automated cudf::column* output = cudf::algorithm( input... ); '''

Pros:

Cons:

jrhemstad commented 5 years ago

I have only one concern about allocating device memory inside of cudf::algorithm(). If allocating is in cudf::algorithm(), it's hard for users to manage device memory allocation by themselves, though I'm not sure if the users want, and there would be some way to pass user defined allocator to cudf::algorithm().

You've alluded to the solution. It's orthogonal to the column design, but ultimately we would like it such that you can pass a custom allocator into any cudf algorithm, similar to how you can with Thrust.

OlivierNV commented 5 years ago

Regarding column names: I think it should eventually not be part of the column itself but part of a parent array of column that would include the schema, containing column names and their relationship (will also allow preserving schema information when reading/writing columnar formats).

nsakharnykh commented 5 years ago

Pain points

  1. Touching raw pointers and underlying data structures of the column a. Difficult to track down out-of-bounds and null access errors b. Difficult to evaluate other internal column formats (e.g. a list of chunks instead of one flat array) c. Difficult to update the validity mask from a CUDA kernel
  2. Memory management a. Who owns the memory allocated for the column, when it should be freed b. What happens when someone is trying to copy data (shallow vs deep copy semantics) c. What are the mechanisms to realloc the column (resize/expand)
eyalroz commented 5 years ago

@nsakharnykh :

1.a. Difficult to track down out-of-bounds and null access errors

You mean, out-of-bounds accesses on the device side? e.g. my_column_data[1234] and such?

1.b. Difficult to evaluate other internal column formats (e.g. a list of chunks instead of one flat array)

Can you elaborate on this some more, with regards to the C++ part of libcudf? Where in the code does this happen?

1.c. Difficult to update the validity mask from a CUDA kernel

See PR #1391; it doesn't resolve anything fully but it makes it easier:

template <typename BitContainer, typename Size>
constexpr CUDA_HOST_DEVICE_CALLABLE
void turn_bit_on(BitContainer* bits, Size bit_index);

And specifically:

turn_bit_on<gdf_valid_type, gdf_size_type>(my_valid_ptr, 1234);

I'm sorry, I haven't addressed all of @harrism comments on that PR yet - will do it today, and ping Mark again.

2.b. What happens when someone is trying to copy data (shallow vs deep copy semantics)

Can you be more explicit? Which scenarios are worrying you exactly?

2.c. What are the mechanisms to realloc the column (resize/expand)

Are you suggesting we should have such mechanisms, or is it just the lack of clarity about this issue?

felipeblazing commented 5 years ago

Pain Points

  1. Non Distinction between mutable and non mutable columns. Columns that are mutable should be deep copied when being moved around, immutable columns can be shared across threads and would only need to be shallow copied.

  2. Inability to plug in various allocators. We have heard lots of mention of having cudf::column manage allocation and deallocation but this would only be feasible for some users if we could make our own allocators to use instead of whatever we pick as default (RMM I am guessing). Some people make temporary work spaces and reuse them throughout their execution to avoid allocations as much as possible.

  3. Lack of a unified standard for data representation. We currently have 3 different different patterns for columns. That of all of the numeric primitives, GDF_STRING, and GDF_STRING_CATEGORY. First is an array of numeric values and a valid bitmask, the second instead of storing an array in col.data we store an object of NVStrings , and the third stores an NVCategory in dtype_info and a COPY of the indices from that NVCategory as an array of integers in the col.data.

  4. Not allowing for users to define their OWN column classes if they want that derive from cudf::column. This would allow users to "enhance" the column class they are using. It would of course require that we send over smart pointers of cudf::columns around instead of just cudf::column

  5. Allowing for the chunking of columns and giving users the opportunities for making each chunk available in the gpu at the time that it is needed. 4 and 5 here are somewhat related the former probably being necessary for users to define WHEN and HOW things get moved into GPU memory before an algorithm will require its usage.

felipeblazing commented 5 years ago

This kind of code cudf::column* output = cudf::algorithm( input... ); sort of freaks me out. For example we have many times where we need to perform temporary materializations and we do something like the following.

max_width = 0
iterate over columns of same length
   if col.bytes_per_row > max_width
       max_width  = col.bytes_per_row

allocate(max_width * num_rows)

iterate over columns
   do some row wise transformation on column and store the interim result in temp space
   perform some kind of reduction operation and store results

This would not be possible if the code was orchestrated as above.

felipeblazing commented 5 years ago

Regarding column names: I think it should eventually not be part of the column itself but part of a parent array of column that would include the schema, containing column names and their relationship (will also allow preserving schema information when reading/writing columnar formats).

Are you saying this so that we could have chunked columns? If so are we really just talking about a cudf::column and each column would have an array (or vector) of cudf::column_chunks ?

One thing that is interesting about this is that it could allow users to define their own implementation of cudf::column_chunks as long as it implemented all of th methods that a cudf::column_chunk would need.

So I envision a example situation where a user has 3 columns which each have their own kind of "chunk".

The first of these the user decides to use an allocator that puts information into UVM becuase the user knows it will only be traversing this data in order and it is very large.

The second contains data from which we will be doing lots of random i/o so that moving the pages in and out would be inefficient so we use some allocator to allocate device memory.

The third is a chunk that is stored in constant memory or shared memory.

I might be starting to get a little beyond the scope of what we were hoping to do here but I would like to keep the options for doing these things later on open.

jrhemstad commented 5 years ago

This kind of code cudf::column* output = cudf::algorithm( input... ); sort of freaks me out. For example we have many times where we need to perform temporary materializations and we do something like the following.

max_width = 0
iterate over columns of same length
   if col.bytes_per_row > max_width
       max_width  = col.bytes_per_row

allocate(max_width * num_rows)

iterate over columns
   do some row wise transformation on column and store the interim result in temp space
   perform some kind of reduction operation and store results

This would not be possible if the code was orchestrated as above.

I think this is a solved problem if we allow you to pass in an allocator to cudf::algorithm to use for temporary workspace and output allocation (perhaps separate allocators for each? But that's starting to sound cumbersome).

cudf::column output = cudf::algorithm( input..., allocator ); (Note the output is not returned via pointer, but by value)

eyalroz commented 5 years ago

@felipeblazing

Are you saying this so that we could have chunked columns? If so are we really just talking about a cudf::column and each column would have an array (or vector) of cudf::column_chunks ?

I'm not sure what Olivier's answer would be, but most (all?) of libcudf is oblivious to the existence of chunks. That is currently - to my knowledge - a non-consideration for the design of the column abstraction. If I am mistaken, someone please correct me on this point.

jrhemstad commented 5 years ago

libcudf is oblivious to the existence of chunks. That is currently - to my knowledge - a non-consideration for the design of the column abstraction. If I am mistaken, someone please correct me on this point.

Agreed, chunked columns are outside the scope of this initial conversation.

OlivierNV commented 5 years ago

Yes, I wasn't suggesting chunking columns, only to store the schema information at a higher level (and column name can really be considered as part of the schema information).

felipeblazing commented 5 years ago

cudf::column output = cudf::algorithm( input..., allocator ); What if there are multiple outputs and we want each output to use its own allocator and have temp space use an allocator that we use for temporary work space?

felipeblazing commented 5 years ago

What are the concerns that one might have if we said we can always initialize a gdf_column with a custom allocator as well so that if it needs to be resized we can do that ourselves. we could even have something like a cudf::allocation object that we pass around with smart pointers.

wmalpica commented 5 years ago

I little late to the party but my two cents: Pain Points:

  1. Always dealing with raw pointers
  2. Manually managing allocations
  3. Manually managing the lifetime of column
  4. Requiring so many external utilities to deal with type-erasure or the bitmasks
  5. Not having to allocate external to all the functions (or most functions) would be nice
  6. Making it easier to deal with GDF_STRING and GDF_STRING_CATEGORY

Requirements

  1. We should be able to define a custom allocator, and if not it defaults to RMM

Comments on comments above

  1. I agree that the name field should not be part of gdf_column.
  2. I that being able to define custom allocators is important and has far reaching implications. 2a. For example any algorithm that needs temp space should be able to receive a custom allocator. 2b. All function outputs should be passed into the function, so that you can define the allocator for it on an output by output basis. I am not particularly fond of Jake's proposal for this to be the standard cudf::column output = cudf::algorithm( input..., allocator );

I think when functions have multiple outputs, it becomes clunky and I like the flexibility of being able to define allocators for different outputs.

jrhemstad commented 5 years ago

What if there are multiple outputs and we want each output to use its own allocator

That seems awfully contrived. Yes, you lose some control when the algorithm allocates the outputs, but you gain a lot more than you lose (safety, delayed evaluation, etc.).

have temp space use an allocator that we use for temporary work space?

Like I said, we could have separate allocators for outputs vs. temporary workspace.

What are the concerns that one might have if we said we can always initialize a gdf_column with a custom allocator as well so that if it needs to be resized we can do that ourselves. we could even have something like a cudf::allocation object that we pass around with smart pointers.

I don't quite follow. I'm thinking cudf::column design will accept an allocator argument such that you can allocate it however you want. Does that answer your question?

eyalroz commented 5 years ago

I would again like to remind us that the result of this discussion is a column abstraction, not a single-leap redesign of the cudf API. The abstraction should support the way you (or me, or whoever is making a comment) wants to have allocations work, and not necessarily, or not at all, force this functionality or implement it through methods of the column classes.

Having said that...

@williamBlazing :

2b. All function outputs should be passed into the function, so that you can define the allocator for it on an output by output basis

Can you give a specific API function and scenario in which this would be relevant?

@felipeblazing :

What are the concerns that one might have if we said we can always initialize a gdf_column with a custom allocator as well so that if it needs to be resized we can do that ourselves. we could even have something like a cudf::allocation object that we pass around with smart pointers.

columns are not / should not be resizable. In fact, after "construction", so to speak, both their host-side and their device-side is/should be immutable. Has there been a decision to the contrary?

also, you suggest support for run-time-determined allocators. This will preclude the use of a unique_ptr-like design for (members of) a cudf column. Did you really intend to suggest this? If you believe compile-time-determined allocators are insufficient, please explain and given an example.

revans2 commented 5 years ago

I agree philosophically that columns should be immutable once built, but at what point is a column done being built?

We have several use cases in real-world processing where we want to concatenate multiple columns together. Coalescing batches into a properly sized one after, a filter or a distributed shuffle really need this. Even a union of more than one table would need this to be efficient. Doing them in pairs with the APIs currently provided is not ideal because of memory and performance issues. Even with lazy materialization and JITing the amount of memory used would be much larger than if we could just append one column to another column.

Without immutability, we lose the possibility of lazy materialization, JITing, or even async processing. But at the same time, there are ways to get the best of both worlds if we know what the use case is we can design around it. @felipeblazing what exactly is your use case where you want to grow an existing column? (By the way, there is a bug in RMM realloc so that would have to be fixed first before any hope of doing this properly)

jrhemstad commented 5 years ago

This will preclude the use of a unique_ptr-like design for (members of) a cudf column. Did you really intend to suggest this? If you believe compile-time-determined allocators are insufficient, please explain and given an example.

I assume you're suggesting passing an allocator as a template argument. The biggest disadvantage is that it precludes stateful allocators, specifically, allocators that use a CUDA stream.

eyalroz commented 5 years ago

I assume you're suggesting passing an allocator as a template argument.

No, that's not what I meant. While std::make_unique() only takes a size, that doesn't mean we would do exactly the same thing. After all, even CUDA unique pointers take an extra argument for allocation. I just meant that the choice of allocator type would be by template, not through virtual allocator base classes and so on.

eyalroz commented 5 years ago

I agree philosophically that columns should be immutable once built, but at what point is a column done being built?

While it's true we haven't defined that anywhere, I can think of two intuitive answers:

  1. You only construct a cudf::column using already-generated data. No such thing as passing an existing column in order to write to it. (If we drop the out-parameter convention for libcudf API functions, that may be workable).
  2. The function that constructs a cudf::column instance is allowed to mutate the device-side data (and perhaps the host-side data; or not), but once it returns that cudf::column - no more mutation

We have several use cases in real-world processing where we want to concatenate multiple columns together. Coalescing batches into a properly sized one after, a filter or a distributed shuffle really need this.

I'm not sure what you mean by a distributed shuffle in this context (especially since libcudf is not itself distributed), but let's take a filter. You've created some column, now you want to apply a filter to it. The result - a new column. You want to filter in-place? No can do, because our libcudf code cannot assume the column is not being used someplace else.

Now, if you wanted to introduce move semantics for columns - that's a reasonable suggestion; but can you really guarantee no other users of the cudf::column? Also, filtering in-place may well be more expensive than filtering out-of-place. And if you're worried about having enough space for the extra column - don't worry; cudf::column's are kind of small.

Without immutability, we lose the possibility of lazy materialization, JITing, or even async processing.

I'm not sure why you would say that. On the contrary, getting rid of variables in favor of single-assignment immutables values is a great help when you want to do those things. In fact, a typical compiler's internal representation uses static single assignment form.

revans2 commented 5 years ago

I think immutable data is important. But forcing it in all situations can be problematic for some optimizations.

The use cases I am looking at are mostly around distributed systems like dask or spark, or for cases where not all of the data fits into the memory of the device. For filter if not all of the data fits in memory prior to the filter we may be able to do it post filter, so we could load and filter the data in batches builtin up a larger output dataset that cen be processed more efficiently.

For the shuffle this happens where several nodes all processed part of the data did a hash partition and shuffled all of the data with the same partition to a single node. Now I want to combine potentially hundreds of these partitions together to continue processing.

eyalroz commented 5 years ago

I think immutable data is important. But forcing it in all situations can be problematic for some optimizations.

The use cases I am looking at are mostly around distributed systems like dask or spark,

I never said anything about what Dask or Spark might do, I was only referring to libcudf - which is unaware of the wider context of execution AFAIK.

or for cases where not all of the data fits into the memory of the device.

libcudf does not itself handle such situations. It seems (and correct me if I'm wrong) that the design choice is to break these cases up into units of work which can fit on the device, and keep libcudf unaware of that decomposition.

For filter if not all of the data fits in memory prior to the filter we may be able to do it post filter, so we could load and filter the data in batches builtin up a larger output dataset that cen be processed more efficiently.

I don't see how this requires mutable columns. What is the significant efficiency gain? Also, recall that in the general case, filter results don't fit in GPU memory either.

For the shuffle this happens where several nodes all processed part of the data did a hash partition and shuffled all of the data with the same partition to a single node. Now I want to combine potentially hundreds of these partitions together to continue processing.

  1. This is why I restricted my suggestion of committing to immutability to the scope of libcudf itself. This kind of activity does not seem to belong within libcudf: I wouldn't want such a distributed system to first write hundreds of separate mini-columns into GPU memory, and create a corresponding cudf::column for each, then finally combine them. This sounds like the kind of work a low-level communications library would do.
  2. If it were decided this kind of functionality should be part of libcudf, I'd make this kind of concatenation some sort of a special case.
revans2 commented 5 years ago

Cudf does not have to know what dask or spark is doing specifically, but it should provide common functionality to those system. Cudf already does some of this with partition, which has really no value outside of a system like these.

With that there are optimizations that making the columns immutable at all times prevents. If all data has to be immutable once it hits cudf that is fine, but as the complexity of columns grows, aka strings and validity vectors, it is no longer a simple operation and each implementation will have. To figure out how to support this on their own.

eyalroz commented 5 years ago

it should provide common functionality to those system. Cudf already does some of this with partition, which has really no value outside of a system like these.

The thing is, libcudf doesn't provide all possible common functionality, and AFAIK it's not supposed to. Granted, though, the exact scope of libcudf doesn't seem to be put in writing well enough (certainly not in the repository's README.md...)

Also, again, can you be specific about concrete optimization opportunities which will be missed due to columns being immutable after construction?

thomcom commented 5 years ago

Pain points

  1. gdf_columns are specifically that - columns. This makes indexing by rows, a common operation, expensive and meticulous at the python level of cudf. Transpose is now supported, but is an extremely expensive operation. Dozens of Pandas API calls support an axis argument, trivially alternating between the index or columns of a Dataframe or Series. It would be nice if the new take on gdf_column wasn't, literally, columns, but supported the ability to "swap axis" in an inexpensive single operation.
  2. Names - Pandas MultiIndex allows for multiple columns with the same name - something that can't be achieved easily in our current setup. We need an abstraction that is more complex than 1:1 between columns-as-allocated and the set of column-identifiers. A column identifier that is simultaneously a name and an index might cover all necessary use cases.
  3. Width x Height concerns - This is maybe a long shot, but CUDF is severely limited by dataframe orientation. Wide matrices quickly lose their edge over CPU based computing. Relating to #1 - columns should have arbitrary axes, and the most efficient implementation should be chosen either by the developer or by the algorithm.
  4. There's a lot of complaints about memory management, for good reason. I work 95+% of the time in python so 😎 but memory pools, reference counting and garbage collection, anyone? Objective-C makes a great example of how to do this well. Our gains are found during the GPU accelerated convolution of many frame values - I don't think that performance loss from GC would be noticeable.
eyalroz commented 5 years ago

@thomcom : Following your listing of points:

  1. Can you explain do you mean by "indexing by rows"? Do you mean accessing tuples - individual rows of a table - rather than dealing with disparate columns? Also, as for transposing - would mean replacing, say, a single column with 1 million elements with 1 million columns with a single element each. Certainly that's not supported... at least not in libcudf :-(
  2. I think you're talking about the Python side here, because in libcudf we don't access columns by name. I also think you can even read from a CSV with multiple columns having the same name (not sure though).
  3. This relates to what you said about transposition. Essentially, libcudf handles columns, not two- or multi-dimensional data; and this is central to the design, beyond the column abstraction even. A library for multi-dimensional data is different kind of beast than a library for manipulating columns; so you're suggesting a change in the library's entire scope. I am not for or against such a change, but here we're just considering a column abstraction for libcudf with its current scope. However, you might consider the encoding of a matrix in a three-column table, with columns for "matrix_row", "matrix_column" and "value:.
  4. Modern C++ avoids the need for garbage collection, using smart pointers and other RAII classes, which are freed exactly when they stop being used. This is (mostly) encapsulated in the Resource Management section of the Core C++ Programming Guidelines. You could argue that this is insufficient/inappropriate for our case and a garbage collector is required - but you would need to make a convincing argument, since a garbage collector is a large amount of very serious code, with all sorts of overhead (for developers and at runtime).
harrism commented 5 years ago

Actually @eyalroz we DO have transpose in libcudf but it's slow for large input columns, for obvious reasons. In general, @thomcom I think we would love to be able to provide efficient multi-axis indexing, but if you think about it, this is a very hard problem. It's easy in, say tensor libraries because everything in the tensor has the same datatype. But in a typed language, how do you efficiently transpose a table of columns with different types so you can access the row memory efficiently? You end up with columns that have to support multiple data types. Not pretty however you take it on...

eyalroz commented 5 years ago

Actually @eyalroz we DO have transpose in libcudf but it's slow for large input columns, for obvious reasons.

Hmm, cognitive dissonance on my part there. Couldn't bring myself to remember something which didn't make sense... #1603 .

thomcom commented 5 years ago

Thanks for your response @eyalroz.

  1. From the perspective of your average ML developer, accessing a tuple - a row of a table, doesn't have a real distinction from accessing a column. A.T.T = A after all. From a math perspective the orientation only matters to determine order of operations. At the user level it would be nice to have the ability to treat rows and columns interchangeably, including, at least, fast transpose. I partially understand your perspective coming from a database design standpoint.

  2. Ok - at the python level column naming has been causing me some real headaches - sounds like it is my problem (literally).

  3. This is specifically why I wanted to bring it up and at least get it thought about here. Following up on @harrism as well - it is possible to encode rows and columns together in contiguous data and index into it in O(1). I agree it could get complicated, and obviously single-type columns are much more efficient than columns with mixed type. I want you guys thinking about it and knowing that Pandas people are asking for it, because transpose and row/column major conversions are used very frequently and they're slow like a barge in python, unless the library is written to just swap labels in python-native data (like Pandas). Then it's just slow like python.

I'm sorry if I'm way out in left field here. I have a fantasy that with careful design we could, for example, transpose 10GB of columns in a few operations, simply by shuffling labels around. This would be powerful.

  1. The most common pain point has been memory management. I'm not terribly up-to-date in C++11 or 14, but I don't get the impression that "Do RAII better" is the answer for our many developers. I suggested reference counting and memory pools because it sounded like the best of both worlds in this context. Developers can simply pull new objects cudf::column cool_column = coolalg(args...) and not worry about managing them, or they can allocate a memory pool and do whatever manual allocations they want, trusting that when the memory pool has been destructed then they have cleaned up after themselves.