Open tpn opened 1 day ago
WIP/draft PR here: https://github.com/rapidsai/cudf/pull/17115
Thank you for the detailed proposal! One question to better understand the WekaFS situation - what is the preferred method to perform IO in this case? GDS?
Anything other than mmap :-) Literally carrier pidgeon would probably have been preferred. And these weren't weak-sauce network links; they were 200GB NICs. In our case, we had a custom KvikioDataSource that essentially mirrored your file_source
, except it didn't ever fall back to mmap -- it would use kvikio.pread() for those small metadata calls.
Edit: actually just to clarify... the WekaFS pathological perf issue was traced to host read calls (because the device threshold of 128KB wasn't met), and in your default implementation, that was resulting in the reads ending up as being serviced by mmap. (Saying that though, I'm looking at the code and wondering how that happened--I seem to recall GDS was working properly for reads >128KB, it was just the <=128KB ones that hit the slow mmap path. But looking at current cudf code, this would imply that the initial cufile_integration check failed (or even ifndef CUFILE_FOUND), resulting in mmap for everything.)
So the solution I propose would have a create()
method look something like this:
std::unique_ptr<datasource> datasource::create(std::string const& filepath,
size_t offset,
size_t max_size_estimate,
size_t min_size_estimate,
datasource_kind kind,
std::optional<const datasource_params&> params)
{
CUDF_EXPECTS(max_size_estimate == 0 or min_size_estimate <= max_size_estimate,
"Invalid min/max size estimates for datasource creation");
switch (kind) {
case datasource_kind::KVIKIO:
case datasource_kind::KVIKIO_COMPAT:
case datasource_kind::KVIKIO_GDS: {
kvikio_datasource_params new_params;
if (params) {
if (auto kvikio_params = std::get_if<kvikio_datasource_params>(¶ms.value())) {
// Copy the user-provided parameters into our local variable.
new_params = *kvikio_params;
} else {
throw cudf::logic_error("Invalid parameters for KVIKIO-based datasource.");
}
}
if (kind == datasource_kind::KVIKIO_COMPAT) {
// Forcibly-set the compatibility mode to true, regardless of what may
// already be present in the params. The `kind` parameter has requested
// `KVIKIO_COMPAT`, and that takes precedence over the `use_compat_mode`
// parameter in the `kvikio_datasource_params`.
new_params.use_compat_mode = true;
} else if (kind == datasource_kind::KVIKIO_GDS) {
// GDS is unique in that we are expected to throw a cudf::runtime_error
// if GDS is not available. The first chance we have to do this is
// here, by way of fencing against CUFILE_FOUND.
#ifndef CUFILE_FOUND
throw cudf::runtime_error("GDS is not available because cuFile is not enabled.");
#endif
// The next check is done against the `is_gds_enabled()` function in
// `cufile_integration`. If GDS is not enabled, we throw a runtime
// error here as well.
if (!cufile_integration::is_gds_enabled()) {
throw cudf::runtime_error("cuFile reports GDS is not available.");
}
// Forcibly-set the compatibility mode to false, regardless of what may
// already be present in the params. The `kind` parameter has requested
// `KVIKIO_GDS`, and that takes precedence over the `use_compat_mode`
// parameter in the `kvikio_datasource_params`.
new_params.use_compat_mode = false;
} else {
CUDF_EXPECTS(kind == datasource_kind::KVIKIO,
"Invariant check failed: kind != datasource_kind::KVIKIO");
// We don't need to do any special handling for `KVIKIO` here.
}
return std::make_unique<kvikio_source>(filepath.c_str(), new_params);
}
case datasource_kind::HOST:
return std::make_unique<host_source>(filepath.c_str());
case datasource_kind::ODIRECT: {
odirect_datasource_params new_params;
if (params) {
if (auto odirect_params = std::get_if<odirect_datasource_params>(¶ms.value())) {
// Copy the user-provided parameters into our local variable.
new_params = *odirect_params;
} else {
throw cudf::logic_error("Invalid parameters for O_DIRECT-based datasource.");
}
}
return std::make_unique<odirect_source>(filepath.c_str(), new_params);
}
case datasource_kind::HOST_MMAP:
return std::make_unique<memory_mapped_source>(
filepath.c_str(), offset, max_size_estimate, min_size_estimate);
default:
CUDF_FAIL("Unsupported datasource kind");
}
}
There would be a new host_source
that does host-only POSIX pread()
calls; no async support, no device support.
odirect_source
deriving from host_source
which does all its sector aligned stuff for host reads against a file descriptor opened with O_DIRECT
, but delegating reading of the final bytes of the file (unless the file is a perfect sector size multiple) to the underlying host_read's read().
memory_mapped_source
which you've already implemented; this proposal would just allow you to explicitly control if you want mmap via kind = datsource_kind::HOST_MMAP
.
The kvikio_source
would be new. It would be similar to your current file_source
implementation, in that it has a private kvikio::FileHandle
that it dispatches device read and host read calls to. But it would also be cognizant of the kvikio_datasource_parms
, which allows a user to customize the threadpool size, task size, whether or not GDS is used at all (use_compat_mode), as well as the device read threshold. Additionally, it needs to be cognizant of whether or not the kind
is KVIKIO_GDS
and make sure it keeps the guarantee to raise a runtime error if GDS isn't available at runtime.
At a prior company that made heavy use of cudf, we ran into serious performance problems with Kvikio/GDS when small reads were being furnished by a mmap'd data source because they did not meet the configured device read threshold (i.e. for reading compressed metadata which would frequently be less than 128KB).
This was because the underlying parquet files lived on a network file system--WekaFS in our case--and mmap'd I/O against WekaFS resulted in pathological performance problems. Especially when it was lots of little <128KB reads that would be paged in 4K at a time over the network in a really inefficient way.
Additionally, we found ourselves needing to have a data source that did O_DIRECT host reads for everything, such that the Linux page cache could be avoided entirely. This was useful when doing back-to-back benchmarks of different features, where the presence or absence of data in the page cache would have huge impacts on runtimes. i.e. it was impossible to see if a PR purporting a 5% perf boost was actually delivering such a boost when back-to-back runs varied by 50% simply due to cold vs hot page cache data.
I'm working on reimplementing the various improvements we introduced in our internal codebase in cudf, such that they can be used by others. The approach that seems most viable, assuming you want to keep all datasource implementation innards behind
datasource.cpp
, is to allowcudf::io::datasource::create()
to take additionaldatasource_kind
anddatasource_params
parameters; see code sample below.I'm actively hacking on a branch locally, I'll throw up a draft PR soon.