Open cwharris opened 4 years ago
I had the same suggestion in the PR that added the supports_device_write
logic. It was rejected because readers/writers have the context to better optimize the IO operations compared to having the host/device logic in the source/sink.
Looks like ORC doesn't have this logic at all, which might be a bug?
Yeah, it's a missing feature. In all components but Parquet. We'll be adding this logic to every format soon, including readers. So, supports_device_read
will also be used very soon :)
because readers/writers have the context to better optimize the IO operations compared to having the host/device logic in the source/sink.
:(
Any chance you have the PR number? I'd like to see the reasoning. I imagining the control could still be inverted...
@cwharris here you go: https://github.com/rapidsai/cudf/pull/4231#discussion_r384104534
Is this device or host memory we're concerned with? @nvdbaranec
Are you asking about my comment re: having to alloc/free on the spot for a default implementation? The thinking there was that it would have to allocate host memory, do a memcpy, then do it's write, then immediately free that memory even though it would probably be doing that multiple times. If you look at how the code is structured here: https://github.com/rapidsai/cudf/blob/ac39e372d771b604b64d1c704c2c846ee1edde2d/cpp/src/io/parquet/writer_impl.cu#L835
It allocates one worst-case sized buffer and then reuses that for multiple write calls. A default implementation would have no context what's going on and what have to alloc/free every time which is undesirable from a performance perspective (doubly so because we're using pinned memory here)
If the invocation sites are bifurcated due to the need for multiple passes using the same buffer, we could use a two-step API instead. The first step uses RAII to obtain a worst-case-size buffer, and the second step writes using that buffer. The second step could be called any number of times. Since the buffer type can differ (or simply be unnecessary), we could wrap it in a context to hide the details. For instance:
{
// may or may not create a buffer, depending on what type of sink is being used.
auto sink_context = out_sink->create_write_context(max_chunk_bfr_size);
for (; r < rnext; r++, global_r++) {
for (auto i = 0; i < num_columns; i++) {
...
// writes using the worst-case-size buffer, if one was required.
sink_context.write(dev_bfr + ck->ck_stat_size, ck.compressed_size, state.stream);
...
}
}
}
// buffer goes out of scope and associated memory is released
That create_write_context
api should accept any relevant information about the data that's going to be written, with regards to the number of passes and size of each pass.
Alternatively, we could use a memory allocator behind the scenes, but that could be more complicated given that the situation is pretty clearly defined in this case.
There is a plan to improve performance by processing files in batches (when possible) so that IO can be overlapped with kernels. We don't currently have the infrastructure to make this change. Since this would significantly change how the sources/sinks are used, I would postpone addressing this issue for now. In addition, I don't think this part of the code is what we need to focus on in the refactoring effort. There are many other components that would benefit more from clean up.
IMO this is not a 'tech debt' issue, removed the tag. Depending on future changes to how IO is pipelined, this issue might not get implemented. Keeping open for now.
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Keeping this open, will be potentially addressed at a later date.
Reader and writer implementations have multiple paths for reading and writing, depending whether data is read-from/written-to the host or the device. This logic could be delegated to
datasource
anddata_sink
, potentially sharing the functionality between alldatasource
s anddata_sink
s.This would eliminate the need for
supports_device_write
andsupports_device_read
, which could also reduce the surface area of testing.note:
supports_device_read
is currently unused.Examples: https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/csv/writer_impl.cu#L412-L442 https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/parquet/writer_impl.cu#L882-L913
Looks like ORC doesn't have this logic at all, which might be a bug? https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/orc/writer_impl.cu#L1246
Hypothetical Source/Sink APIs
data_kind
is used to describe the caller-owned buffer.An API such as this delegates the read/write logic to the source/sink, but gives enough information for the source/sink to determine how data should be copied, and whether or not a sync is necessary to perform the copies. For instance, if the specific source implementation is reading data on device, and the
read(...)
call is made withdata_kind::device
, then the source has enough information to execute a device-to-device copy, without or without syncing the stream (perhaps an API an optional method should be added to ensure sync has taken place).