Closed jrhemstad closed 4 years ago
My pain points:
static_cast<T>(col->data)
, if(nullptr == col->valid)
, gdf_is_valid(col->valid, i)
, etc.gdf_valid_allocation_size
and gdf_num_bitmask_elements
. cudf::column
object to be used via Cython for better cudf<->libcudf interop.cudf::column
for better maintainability and cross-language functionality.optional
and variant
to express aspects of columns' semantics. It's the idiomatic thing to do.Mostly reiterating what I think are the most important things to fix first. Basically, the problem is that gdf_column lacks encapsulation.
numba.cuda.device_array
allocation. I have an idea regarding step 5. Revise to:
I think this will help flush out more issues in the abstraction. We could keep it at one candidate feature, done by a different engineer than the column implementer, but two would be nice.
I also would like to tighten up the schedule a bit. I would like to get the basic abstraction designed and implemented by 0.8 release. If we can't do that, then I think the scope is too broad and should be narrowed. Since most of our pain points agree, I don't think the scope is too large at this point.
Updated with @harrism's feedback about schedule and the initial refactor process.
@jlowe and everyone else: About column names - do we actually use column names at all in libcudf? I'm not asking about Python code, just the C/C++.
My pain points:
bitmap iterator and/or class (validity bitmask handling nulls) The rack of bitmap iterator makes the implementation messy when the API handles the nulls. It would be easy to handle nulls if we use const bitmap iterator with transform iterator. Also validity bitmask class will be useful to manage nulls including allocating/deallocating helper
input colum check at each function gdf functions usually have sanity check for input/output columns, it's better if cudf::column has sanity check API: GDF_REQUIRE(nullptr != col_in, GDF_DATASET_EMPTY); GDF_REQUIRE(nullptr == col_in->valid || 0 == col_in->null_count, GDF_VALIDITY_UNSUPPORTED); or CUDF_EXPECTS(col != nullptr, "Input column is null");
Returning new columns from cudf algorithms is risky Same as @harrism The allocation of the columns should be out side of the algorithms.
cudf::test::column_wrapper
@kovaltan :
validate(my_gdf_column)
which performs all of these checks.@eyalroz ,
- You suggest that the allocation of columns should occur outside of "the algorithms" (by which I take it you mean the implementation of cudf API functions). But is that a pain point about the column abstraction we use? And - do you think this should be resolved through the design of a column class?
I mean "the algorithms" like cudf::scan(), does not mean cudf::column class. For now, we passes pre-allocate output column to cudf::scan(). I think it should be pre-allocated, should not be allocated in cudf::scan(). cudf::column class can allocate device memory, and may have device_memory_owner flag. cudf::column should free the device memory if device_memory_owner = true.
- About the input checks: Note that, for now, we have PR #1390, and specifically
validate(my_gdf_column)
which performs all of these checks.
It's really nice!
@eyalroz
About column names - do we actually use column names at all in libcudf?
The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller. However I could see some debate whether it really needs to be stored directly in the column structure.
@eyalroz and @kovaltan I do agree that a column should have some knowledge about ownership of the memory it references. I also realize that
Support delayed materialization or lazy evaluation
is not a part of the design goals here. But I think it will become an important feature likely combined with JITing so we should at least think ahead to how this might work and not paint ourselves into a corner. One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory.
So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation. I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it. That should give us the flexibility to do just about anything we want with memory management of the columns, but also lets the algorithms themselves decide when to allocate memory, and even how much memory to allocate.
Returning new columns from cudf algorithms is risky Same as @harrism The allocation of the columns should be out side of the algorithms.
@kovaltan I think you may have misunderstood Mark's comment here:
Returning new columns from cudf algorithms is risky because most routines still take pre-allocated column memory as an argument. Users may think that because it's returned they don't have to free the memory, which would lead to leaks.
I believe he was referring to how presently returning a gdf_column
from a function is risky because we don't have any gdf_column
destructor, so the burden falls on the caller to free the column's underlying memory.
For example:
// This is leaky because it requires the caller to remember to free the column's device memory
gdf_column output = gdf_algorithm(input...);
However, we have identified that is desirable to be able to return outputs from functions rather than use output parameters (for various reasons: clearer style, delayed JIT evaluation, etc.). Therefore, we want to make that safer via ownership semantics.
As such, we likely will be moving in the direction that algorithms allocate the device memory for outputs.
To be clear:
// This is safe because lifetime management of the device memory is automated
cudf::column output = cudf::algorithm( input... );
@revans2 :
I ... realize that .... delayed materialization or lazy evaluation ... is not a part of the design goals here. But I think it will become an important feature ... so we should at least think ahead to how this might work and not paint ourselves into a corner.
You are preaching to the choir - I definitely agree (in fact I would love to work on faster JITing using PTX, regardless of cudf). This part of why I've proposed removing some of the features of columns for the basic/fundamental abstraction - flexibility for later use.
One of the key pieces of doing the lazy evaluation is that it can eliminate intermediate results and as such will not need to allocate the corresponding memory. So if we want to be able to eliminate memory allocation we cannot pre-allocate it for every operation.
But why are we certain that the allocating code will be part of what gets JITed together? Clearly, to construct operators by JITting we would use composable building-blocks which are not the functions in cudf/functions.h
. So I would not be worried about this point.
I would propose, instead, that the operation itself do the device memory allocation. Not allocate the column class, but allocate the device memory the column ultimately points to. We could structure the column class so that it knows how to "allocate" and "free" memory which can be overridden by the user creating it.
I'm thinking about a design which is reminiscent of std::unique_ptr
- where a choice of a different template parameter allows for a different de/allocation mechanism. If I followed you correctly, I think that should allow for the behavior you described.
@jlowe :
The column names are filled in by the csv and parquet parsers. There needs to be some way to convey the names discovered during parsing back to the caller.
Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all?
Certainly, but the parsers are not part of libcudf; so why should this extra information be passed to libcudf if it doesn't get used there at all?
Since when? The CSV, parquet, and soon ORC and Jason-lines parsers are all part of libcudf.
@harrism : Yes, right, I misspoke. That is part of the libcudf codebase. But the library itself, or rather the API functions which process columns, don't take CSVs and parse them, nor do they make calls to CSV-parsing code. And the names aren't used (it seems).
So, what if the names were to simply be parsed into a separate array in the csv_read_args
structure?
@jrhemstad
Ah, I've misunderstood Mark's comment. I have only one concern about allocating device memory inside of cudf::algorithm(). If allocating is in cudf::algorithm(), it's hard for users to manage device memory allocation by themselves, though I'm not sure if the users want, and there would be some way to pass user defined allocator to cudf::algorithm().
''' // This is safe because lifetime management of the device memory is automated cudf::column* output = cudf::algorithm( input... ); '''
Pros:
Cons:
I have only one concern about allocating device memory inside of cudf::algorithm(). If allocating is in cudf::algorithm(), it's hard for users to manage device memory allocation by themselves, though I'm not sure if the users want, and there would be some way to pass user defined allocator to cudf::algorithm().
You've alluded to the solution. It's orthogonal to the column design, but ultimately we would like it such that you can pass a custom allocator into any cudf algorithm, similar to how you can with Thrust.
Regarding column names: I think it should eventually not be part of the column itself but part of a parent array of column that would include the schema, containing column names and their relationship (will also allow preserving schema information when reading/writing columnar formats).
@nsakharnykh :
1.a. Difficult to track down out-of-bounds and null access errors
You mean, out-of-bounds accesses on the device side? e.g. my_column_data[1234]
and such?
1.b. Difficult to evaluate other internal column formats (e.g. a list of chunks instead of one flat array)
Can you elaborate on this some more, with regards to the C++ part of libcudf? Where in the code does this happen?
1.c. Difficult to update the validity mask from a CUDA kernel
See PR #1391; it doesn't resolve anything fully but it makes it easier:
template <typename BitContainer, typename Size>
constexpr CUDA_HOST_DEVICE_CALLABLE
void turn_bit_on(BitContainer* bits, Size bit_index);
And specifically:
turn_bit_on<gdf_valid_type, gdf_size_type>(my_valid_ptr, 1234);
I'm sorry, I haven't addressed all of @harrism comments on that PR yet - will do it today, and ping Mark again.
2.b. What happens when someone is trying to copy data (shallow vs deep copy semantics)
Can you be more explicit? Which scenarios are worrying you exactly?
2.c. What are the mechanisms to realloc the column (resize/expand)
Are you suggesting we should have such mechanisms, or is it just the lack of clarity about this issue?
Non Distinction between mutable and non mutable columns. Columns that are mutable should be deep copied when being moved around, immutable columns can be shared across threads and would only need to be shallow copied.
Inability to plug in various allocators. We have heard lots of mention of having cudf::column manage allocation and deallocation but this would only be feasible for some users if we could make our own allocators to use instead of whatever we pick as default (RMM I am guessing). Some people make temporary work spaces and reuse them throughout their execution to avoid allocations as much as possible.
Lack of a unified standard for data representation. We currently have 3 different different patterns for columns. That of all of the numeric primitives, GDF_STRING, and GDF_STRING_CATEGORY. First is an array of numeric values and a valid bitmask, the second instead of storing an array in col.data we store an object of NVStrings , and the third stores an NVCategory in dtype_info and a COPY of the indices from that NVCategory as an array of integers in the col.data.
Not allowing for users to define their OWN column classes if they want that derive from cudf::column. This would allow users to "enhance" the column class they are using. It would of course require that we send over smart pointers of cudf::columns around instead of just cudf::column
Allowing for the chunking of columns and giving users the opportunities for making each chunk available in the gpu at the time that it is needed. 4 and 5 here are somewhat related the former probably being necessary for users to define WHEN and HOW things get moved into GPU memory before an algorithm will require its usage.
This kind of code cudf::column* output = cudf::algorithm( input... );
sort of freaks me out. For example we have many times where we need to perform temporary materializations and we do something like the following.
max_width = 0
iterate over columns of same length
if col.bytes_per_row > max_width
max_width = col.bytes_per_row
allocate(max_width * num_rows)
iterate over columns
do some row wise transformation on column and store the interim result in temp space
perform some kind of reduction operation and store results
This would not be possible if the code was orchestrated as above.
Regarding column names: I think it should eventually not be part of the column itself but part of a parent array of column that would include the schema, containing column names and their relationship (will also allow preserving schema information when reading/writing columnar formats).
Are you saying this so that we could have chunked columns? If so are we really just talking about a cudf::column and each column would have an array (or vector) of cudf::column_chunks ?
One thing that is interesting about this is that it could allow users to define their own implementation of cudf::column_chunks as long as it implemented all of th methods that a cudf::column_chunk would need.
So I envision a example situation where a user has 3 columns which each have their own kind of "chunk".
The first of these the user decides to use an allocator that puts information into UVM becuase the user knows it will only be traversing this data in order and it is very large.
The second contains data from which we will be doing lots of random i/o so that moving the pages in and out would be inefficient so we use some allocator to allocate device memory.
The third is a chunk that is stored in constant memory or shared memory.
I might be starting to get a little beyond the scope of what we were hoping to do here but I would like to keep the options for doing these things later on open.
This kind of code
cudf::column* output = cudf::algorithm( input... );
sort of freaks me out. For example we have many times where we need to perform temporary materializations and we do something like the following.max_width = 0 iterate over columns of same length if col.bytes_per_row > max_width max_width = col.bytes_per_row allocate(max_width * num_rows) iterate over columns do some row wise transformation on column and store the interim result in temp space perform some kind of reduction operation and store results
This would not be possible if the code was orchestrated as above.
I think this is a solved problem if we allow you to pass in an allocator to cudf::algorithm
to use for temporary workspace and output allocation (perhaps separate allocators for each? But that's starting to sound cumbersome).
cudf::column output = cudf::algorithm( input..., allocator );
(Note the output is not returned via pointer, but by value)
@felipeblazing
Are you saying this so that we could have chunked columns? If so are we really just talking about a cudf::column and each column would have an array (or vector) of cudf::column_chunks ?
I'm not sure what Olivier's answer would be, but most (all?) of libcudf is oblivious to the existence of chunks. That is currently - to my knowledge - a non-consideration for the design of the column abstraction. If I am mistaken, someone please correct me on this point.
libcudf is oblivious to the existence of chunks. That is currently - to my knowledge - a non-consideration for the design of the column abstraction. If I am mistaken, someone please correct me on this point.
Agreed, chunked columns are outside the scope of this initial conversation.
Yes, I wasn't suggesting chunking columns, only to store the schema information at a higher level (and column name can really be considered as part of the schema information).
cudf::column output = cudf::algorithm( input..., allocator );
What if there are multiple outputs and we want each output to use its own allocator and have temp space use an allocator that we use for temporary work space?
What are the concerns that one might have if we said we can always initialize a gdf_column with a custom allocator as well so that if it needs to be resized we can do that ourselves. we could even have something like a cudf::allocation object that we pass around with smart pointers.
I little late to the party but my two cents: Pain Points:
Requirements
Comments on comments above
cudf::column output = cudf::algorithm( input..., allocator );
I think when functions have multiple outputs, it becomes clunky and I like the flexibility of being able to define allocators for different outputs.
What if there are multiple outputs and we want each output to use its own allocator
That seems awfully contrived. Yes, you lose some control when the algorithm allocates the outputs, but you gain a lot more than you lose (safety, delayed evaluation, etc.).
have temp space use an allocator that we use for temporary work space?
Like I said, we could have separate allocators for outputs vs. temporary workspace.
What are the concerns that one might have if we said we can always initialize a gdf_column with a custom allocator as well so that if it needs to be resized we can do that ourselves. we could even have something like a cudf::allocation object that we pass around with smart pointers.
I don't quite follow. I'm thinking cudf::column
design will accept an allocator
argument such that you can allocate it however you want. Does that answer your question?
I would again like to remind us that the result of this discussion is a column abstraction, not a single-leap redesign of the cudf API. The abstraction should support the way you (or me, or whoever is making a comment) wants to have allocations work, and not necessarily, or not at all, force this functionality or implement it through methods of the column classes.
Having said that...
@williamBlazing :
2b. All function outputs should be passed into the function, so that you can define the allocator for it on an output by output basis
Can you give a specific API function and scenario in which this would be relevant?
@felipeblazing :
What are the concerns that one might have if we said we can always initialize a gdf_column with a custom allocator as well so that if it needs to be resized we can do that ourselves. we could even have something like a cudf::allocation object that we pass around with smart pointers.
columns are not / should not be resizable. In fact, after "construction", so to speak, both their host-side and their device-side is/should be immutable. Has there been a decision to the contrary?
also, you suggest support for run-time-determined allocators. This will preclude the use of a unique_ptr-like design for (members of) a cudf column. Did you really intend to suggest this? If you believe compile-time-determined allocators are insufficient, please explain and given an example.
I agree philosophically that columns should be immutable once built, but at what point is a column done being built?
We have several use cases in real-world processing where we want to concatenate multiple columns together. Coalescing batches into a properly sized one after, a filter or a distributed shuffle really need this. Even a union of more than one table would need this to be efficient. Doing them in pairs with the APIs currently provided is not ideal because of memory and performance issues. Even with lazy materialization and JITing the amount of memory used would be much larger than if we could just append one column to another column.
Without immutability, we lose the possibility of lazy materialization, JITing, or even async processing. But at the same time, there are ways to get the best of both worlds if we know what the use case is we can design around it. @felipeblazing what exactly is your use case where you want to grow an existing column? (By the way, there is a bug in RMM realloc so that would have to be fixed first before any hope of doing this properly)
This will preclude the use of a unique_ptr-like design for (members of) a cudf column. Did you really intend to suggest this? If you believe compile-time-determined allocators are insufficient, please explain and given an example.
I assume you're suggesting passing an allocator as a template argument. The biggest disadvantage is that it precludes stateful allocators, specifically, allocators that use a CUDA stream.
I assume you're suggesting passing an allocator as a template argument.
No, that's not what I meant. While std::make_unique()
only takes a size, that doesn't mean we would do exactly the same thing. After all, even CUDA unique pointers take an extra argument for allocation. I just meant that the choice of allocator type would be by template, not through virtual allocator base classes and so on.
I agree philosophically that columns should be immutable once built, but at what point is a column done being built?
While it's true we haven't defined that anywhere, I can think of two intuitive answers:
We have several use cases in real-world processing where we want to concatenate multiple columns together. Coalescing batches into a properly sized one after, a filter or a distributed shuffle really need this.
I'm not sure what you mean by a distributed shuffle in this context (especially since libcudf is not itself distributed), but let's take a filter. You've created some column, now you want to apply a filter to it. The result - a new column. You want to filter in-place? No can do, because our libcudf code cannot assume the column is not being used someplace else.
Now, if you wanted to introduce move semantics for columns - that's a reasonable suggestion; but can you really guarantee no other users of the cudf::column? Also, filtering in-place may well be more expensive than filtering out-of-place. And if you're worried about having enough space for the extra column - don't worry; cudf::column's are kind of small.
Without immutability, we lose the possibility of lazy materialization, JITing, or even async processing.
I'm not sure why you would say that. On the contrary, getting rid of variables in favor of single-assignment immutables values is a great help when you want to do those things. In fact, a typical compiler's internal representation uses static single assignment form.
I think immutable data is important. But forcing it in all situations can be problematic for some optimizations.
The use cases I am looking at are mostly around distributed systems like dask or spark, or for cases where not all of the data fits into the memory of the device. For filter if not all of the data fits in memory prior to the filter we may be able to do it post filter, so we could load and filter the data in batches builtin up a larger output dataset that cen be processed more efficiently.
For the shuffle this happens where several nodes all processed part of the data did a hash partition and shuffled all of the data with the same partition to a single node. Now I want to combine potentially hundreds of these partitions together to continue processing.
I think immutable data is important. But forcing it in all situations can be problematic for some optimizations.
The use cases I am looking at are mostly around distributed systems like dask or spark,
I never said anything about what Dask or Spark might do, I was only referring to libcudf - which is unaware of the wider context of execution AFAIK.
or for cases where not all of the data fits into the memory of the device.
libcudf does not itself handle such situations. It seems (and correct me if I'm wrong) that the design choice is to break these cases up into units of work which can fit on the device, and keep libcudf unaware of that decomposition.
For filter if not all of the data fits in memory prior to the filter we may be able to do it post filter, so we could load and filter the data in batches builtin up a larger output dataset that cen be processed more efficiently.
I don't see how this requires mutable columns. What is the significant efficiency gain? Also, recall that in the general case, filter results don't fit in GPU memory either.
For the shuffle this happens where several nodes all processed part of the data did a hash partition and shuffled all of the data with the same partition to a single node. Now I want to combine potentially hundreds of these partitions together to continue processing.
Cudf does not have to know what dask or spark is doing specifically, but it should provide common functionality to those system. Cudf already does some of this with partition, which has really no value outside of a system like these.
With that there are optimizations that making the columns immutable at all times prevents. If all data has to be immutable once it hits cudf that is fine, but as the complexity of columns grows, aka strings and validity vectors, it is no longer a simple operation and each implementation will have. To figure out how to support this on their own.
it should provide common functionality to those system. Cudf already does some of this with partition, which has really no value outside of a system like these.
The thing is, libcudf doesn't provide all possible common functionality, and AFAIK it's not supposed to. Granted, though, the exact scope of libcudf doesn't seem to be put in writing well enough (certainly not in the repository's README.md
...)
Also, again, can you be specific about concrete optimization opportunities which will be missed due to columns being immutable after construction?
axis
argument, trivially alternating between the index or columns of a Dataframe or Series. It would be nice if the new take on gdf_column
wasn't, literally, columns, but supported the ability to "swap axis" in an inexpensive single operation.@thomcom : Following your listing of points:
Actually @eyalroz we DO have transpose
in libcudf but it's slow for large input columns, for obvious reasons. In general, @thomcom I think we would love to be able to provide efficient multi-axis indexing, but if you think about it, this is a very hard problem. It's easy in, say tensor libraries because everything in the tensor has the same datatype. But in a typed language, how do you efficiently transpose a table of columns with different types so you can access the row memory efficiently? You end up with columns that have to support multiple data types. Not pretty however you take it on...
Actually @eyalroz we DO have
transpose
in libcudf but it's slow for large input columns, for obvious reasons.
Hmm, cognitive dissonance on my part there. Couldn't bring myself to remember something which didn't make sense... #1603 .
Thanks for your response @eyalroz.
From the perspective of your average ML developer, accessing a tuple - a row of a table, doesn't have a real distinction from accessing a column. A.T.T = A
after all. From a math perspective the orientation only matters to determine order of operations. At the user level it would be nice to have the ability to treat rows and columns interchangeably, including, at least, fast transpose. I partially understand your perspective coming from a database design standpoint.
Ok - at the python level column naming has been causing me some real headaches - sounds like it is my problem (literally).
This is specifically why I wanted to bring it up and at least get it thought about here. Following up on @harrism as well - it is possible to encode rows and columns together in contiguous data and index into it in O(1). I agree it could get complicated, and obviously single-type columns are much more efficient than columns with mixed type. I want you guys thinking about it and knowing that Pandas people are asking for it, because transpose and row/column major conversions are used very frequently and they're slow like a barge in python, unless the library is written to just swap labels in python-native data (like Pandas). Then it's just slow like python.
I'm sorry if I'm way out in left field here. I have a fantasy that with careful design we could, for example, transpose 10GB of columns in a few operations, simply by shuffling labels around. This would be powerful.
cudf::column cool_column = coolalg(args...)
and not worry about managing them, or they can allocate a memory pool and do whatever manual allocations they want, trusting that when the memory pool has been destructed then they have cleaned up after themselves.
Creating and interfacing with the
gdf_column
C struct is rife with problems. We need better abstraction(s) to make libcudf a safer and more pleasant library to use and develop.In lieu of adding support for any new column types, the initial goal of a
cudf::column
design is to ease the current pain points in creating and interfacing with the column data structure in libcudf.Goals:
gdf_column
structureNon-Goals
Note that a “Non-Goal” is not something that we want to expressly forbid in our redesign, but rather are not the focus of the current effort. Whenever possible, we can make design decisions that will enable these “Non-Goals” sometime in the future, so long as those decisions do not compromise the timely accomplishment of the above “Goals”
Process
1. Gather pain points
2. Derive requirements
3. Design Mock Up
4. Implementation
5. Initial Refactor
gdf_column
to the new abstraction(s)6. Full Refactor