Closed LutzCle closed 1 year ago
It's possible that we're hitting the OS limit for the number of open files in a single process.
Please try running the same code with environment variable LIBCUDF_CUFILE_POLICY="OFF"
. The number of times we open each file should be reduced in that case so libcudf shouldn't fail with less than ~1000 files.
Indeed, we're hitting the OS limit for the number of open files. Increasing the DefaultLimitNOLIMIT
of systemd to one million files gets rid of the exception, and the program successfully runs to completion.
However, I was curious why the previous limit of 1024 files already caused the program to fail at about 350 files. strace
-ing the program, it turns out that cudf::io::read_parquet
opens each file three times:
openat(AT_FDCWD, "$HOME/datasets/tpcds-sf1-custom/store_sales/ss_sold_date_sk=2452129/part-00008-67832d71-8413-4d53-88fa-f1c1d791638d.c000.snappy.parquet", O_RDONLY) = 13
fstat(13, {st_mode=S_IFREG|0644, st_size=117314, ...}) = 0
openat(AT_FDCWD, "$HOME/datasets/tpcds-sf1-custom/store_sales/ss_sold_date_sk=2452129/part-00008-67832d71-8413-4d53-88fa-f1c1d791638d.c000.snappy.parquet", O_RDONLY|O_CLOEXEC) = 14
openat(AT_FDCWD, "$HOME/datasets/tpcds-sf1-custom/store_sales/ss_sold_date_sk=2452129/part-00008-67832d71-8413-4d53-88fa-f1c1d791638d.c000.snappy.parquet", O_RDONLY|O_DIRECT|O_CLOEXEC) = 15
mmap(NULL, 117314, PROT_READ, MAP_PRIVATE, 13, 0) = 0x7f5f98cec000
...
Also, all files remain open at the same time.
Yes, limiting the number of files that are simultaneously opened within the reader would solve the issue without tweaking OS parameters.
Hello @LutzCle, I'm very glad you found an option to proceed with this workflow by increasing DefaultLimitNOLIMIT
. As far as why we are opening the file three times, I expect it is to access the file header, footer, and contents separately. Please let us know if you would like to discuss futher.
Hello @LutzCle, I'm very glad you found an option to proceed with this workflow by increasing
DefaultLimitNOLIMIT
. As far as why we are opening the file three times, I expect it is to access the file header, footer, and contents separately. Please let us know if you would like to discuss futher.
Yes, we open the file once to read metadata (e.g. footer) to system memory. kvikIO opens it again for device reads, but twice - with and without direct mode. Older GDS versions require direct mode, but non-GDS reads can be faster without direct mode, as caching can be leveraged. We can definitely open the files twice instead of three times. I believe newer GDS versions allow us to open the file only once, but this is a longer term item.
Thanks for getting back to me.
We can definitely open the files twice instead of three times. I believe newer GDS versions allow us to open the file only once, but this is a longer term item.
That's fine, but is still opening O(N)
files at the same time. To solve the issue would require opening O(1)
files at the same time, no?
Describe the bug Passing a
std::vector
with hundreds of files tocudf::io::read_parquet
leads the reader to throw the exception:This error occurs even though all files exist, are non-empty, and have the same schema. The files can be read just fine with
cudf::io::read_parquet
one-by-one in a for-loop.The exact number of files when the exception is thrown is not deterministic. Sometimes it works with, e.g., 351 files but fails on the next try.
Steps/Code to reproduce bug
Call
cudf::io::read_parquet
with a few hundred files. In my use-case, I tried to read a Hive-partitioned TPC-DS dataset. See Spark instructions on data generation. Result:Code to reproduce the bug:
Expected behavior
cudf::io::read_parquet
should read the files and return the data in acudf::table
.Environment overview (please complete the following information)
Environment details
Click here to see environment details