rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 908 forks source link

[FEA] cuio: allow datasource and data_sink to decide how device/host read/write/copies are handled. #6187

Open cwharris opened 4 years ago

cwharris commented 4 years ago

Reader and writer implementations have multiple paths for reading and writing, depending whether data is read-from/written-to the host or the device. This logic could be delegated to datasource and data_sink, potentially sharing the functionality between all datasources and data_sinks.

This would eliminate the need for supports_device_write and supports_device_read, which could also reduce the surface area of testing.

note: supports_device_read is currently unused.

Examples: https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/csv/writer_impl.cu#L412-L442 https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/parquet/writer_impl.cu#L882-L913

Looks like ORC doesn't have this logic at all, which might be a bug? https://github.com/rapidsai/cudf/blob/f4735c7f658da4a157dc09391da899b072878305/cpp/src/io/orc/writer_impl.cu#L1246

Hypothetical Source/Sink APIs

data_kind is used to describe the caller-owned buffer.

// note: data_kind is not used to determine the sink/source buffer's kind.
//       the sink/source buffer kind is an implementation detail of the given sink/source.
enum class data_kind {
  host,  // used when the caller is writing data from host, or reading data to host
  device // used when the caller is writing data from device, or reading data to device
};

An API such as this delegates the read/write logic to the source/sink, but gives enough information for the source/sink to determine how data should be copied, and whether or not a sync is necessary to perform the copies. For instance, if the specific source implementation is reading data on device, and the read(...) call is made with data_kind::device, then the source has enough information to execute a device-to-device copy, without or without syncing the stream (perhaps an API an optional method should be added to ensure sync has taken place).

class base_source_context {
  base_source_context(cudaStream_t stream) : stream(stream) {}

  virtual size_t read(uint8_t const* data, size_t size, data_kind kind) = 0;

 private:
  cudaStream_t stream;
};

class base_source {
 public:
  virtual unique_ptr<base_source_context> begin_read(cudaStream_t stream) = 0;
};
class base_sink_context {
 public:
  base_sink_context(cudaStream_t stream) : stream(stream) {}

  virtual void write(uint8_t const* data, size_t size, data_kind kind) = 0;

 private:
  cudaStream_t stream;
};

class base_sink {
 public:
  virtual unique_ptr<base_sink_context> begin_write(cudaStream_t stream) = 0;
};
vuule commented 4 years ago

I had the same suggestion in the PR that added the supports_device_write logic. It was rejected because readers/writers have the context to better optimize the IO operations compared to having the host/device logic in the source/sink.

Looks like ORC doesn't have this logic at all, which might be a bug?

Yeah, it's a missing feature. In all components but Parquet. We'll be adding this logic to every format soon, including readers. So, supports_device_read will also be used very soon :)

cwharris commented 4 years ago

because readers/writers have the context to better optimize the IO operations compared to having the host/device logic in the source/sink.

:(

Any chance you have the PR number? I'd like to see the reasoning. I imagining the control could still be inverted...

vuule commented 4 years ago

@cwharris here you go: https://github.com/rapidsai/cudf/pull/4231#discussion_r384104534

cwharris commented 4 years ago

Is this device or host memory we're concerned with? @nvdbaranec

nvdbaranec commented 4 years ago

Are you asking about my comment re: having to alloc/free on the spot for a default implementation? The thinking there was that it would have to allocate host memory, do a memcpy, then do it's write, then immediately free that memory even though it would probably be doing that multiple times. If you look at how the code is structured here: https://github.com/rapidsai/cudf/blob/ac39e372d771b604b64d1c704c2c846ee1edde2d/cpp/src/io/parquet/writer_impl.cu#L835

It allocates one worst-case sized buffer and then reuses that for multiple write calls. A default implementation would have no context what's going on and what have to alloc/free every time which is undesirable from a performance perspective (doubly so because we're using pinned memory here)

cwharris commented 4 years ago

If the invocation sites are bifurcated due to the need for multiple passes using the same buffer, we could use a two-step API instead. The first step uses RAII to obtain a worst-case-size buffer, and the second step writes using that buffer. The second step could be called any number of times. Since the buffer type can differ (or simply be unnecessary), we could wrap it in a context to hide the details. For instance:

{
  // may or may not create a buffer, depending on what type of sink is being used.
  auto sink_context = out_sink->create_write_context(max_chunk_bfr_size);
  for (; r < rnext; r++, global_r++) {
    for (auto i = 0; i < num_columns; i++) {
      ...
      // writes using the worst-case-size buffer, if one was required.
      sink_context.write(dev_bfr + ck->ck_stat_size, ck.compressed_size, state.stream);
      ...
    }
  }   
}
// buffer goes out of scope and associated memory is released

That create_write_context api should accept any relevant information about the data that's going to be written, with regards to the number of passes and size of each pass.

Alternatively, we could use a memory allocator behind the scenes, but that could be more complicated given that the situation is pretty clearly defined in this case.

vuule commented 4 years ago

There is a plan to improve performance by processing files in batches (when possible) so that IO can be overlapped with kernels. We don't currently have the infrastructure to make this change. Since this would significantly change how the sources/sinks are used, I would postpone addressing this issue for now. In addition, I don't think this part of the code is what we need to focus on in the refactoring effort. There are many other components that would benefit more from clean up.

vuule commented 4 years ago

IMO this is not a 'tech debt' issue, removed the tag. Depending on future changes to how IO is pipelined, this issue might not get implemented. Keeping open for now.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vuule commented 3 years ago

Keeping this open, will be potentially addressed at a later date.