[BUG] Parquet reader fails when trying to read hundreds of files.

LutzCle commented 1 year ago

Describe the bug Passing a std::vector with hundreds of files to cudf::io::read_parquet leads the reader to throw the exception:

Cannot query file size

This error occurs even though all files exist, are non-empty, and have the same schema. The files can be read just fine with cudf::io::read_parquet one-by-one in a for-loop.

The exact number of files when the exception is thrown is not deterministic. Sometimes it works with, e.g., 351 files but fails on the next try.

Steps/Code to reproduce bug

Call cudf::io::read_parquet with a few hundred files. In my use-case, I tried to read a Hive-partitioned TPC-DS dataset. See Spark instructions on data generation. Result:

$ ls ~/datasets/tpcds-sf1-custom/store_sales/ | wc -l
1824

Code to reproduce the bug:

#include <cudf/io/parquet.hpp>

#include <cstdlib>
#include <filesystem>
#include <string>
#include <vector>

int main()
{
  constexpr const char* tpcds_path = "$(insert_your_path)/tpcds-sf1-custom/store_sales";

  std::vector<std::string> parquet_files;
  for (auto const& entry : std::filesystem::recursive_directory_iterator(tpcds_path)) {
    if (entry.is_regular_file() && entry.path().extension().string() == ".parquet") {
      parquet_files.push_back(entry.path());
    }
  }

  auto source              = cudf::io::source_info(parquet_files);
  auto options             = cudf::io::parquet_reader_options::builder(source);
  auto table_with_metadata = cudf::io::read_parquet(options);

  return EXIT_SUCCESS;
}

Expected behavior cudf::io::read_parquet should read the files and return the data in a cudf::table.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: conda

Environment details

Click here to see environment details


     **git***
     [redacted]
     Author: Clemens Lutz 
     Date:   Fri Apr 14 17:45:04 2023 +0200

     ***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=22.04
     DISTRIB_CODENAME=jammy
     DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"
     PRETTY_NAME="Ubuntu 22.04.2 LTS"
     NAME="Ubuntu"
     VERSION_ID="22.04"
     VERSION="22.04.2 LTS (Jammy Jellyfish)"
     VERSION_CODENAME=jammy
     ID=ubuntu
     ID_LIKE=debian
     HOME_URL="https://www.ubuntu.com/"
     SUPPORT_URL="https://help.ubuntu.com/"
     BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
     UBUNTU_CODENAME=jammy
     VARIANT="NVIDIA IT Managed"
     Linux 04b7ad9-lcelt 5.19.0-38-generic #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Thu Apr 20 14:28:21 2023
     +---------------------------------------------------------------------------------------+
     | NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
     |-----------------------------------------+----------------------+----------------------+
     | GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                                         |                      |               MIG M. |
     |=========================================+======================+======================|
     |   0  NVIDIA RTX A3000 12GB Lap...    On | 00000000:01:00.0 Off |                  N/A |
     | N/A   49C    P0               N/A /  60W|      3MiB / 12288MiB |      0%      Default |
     |                                         |                      |                  N/A |
     +-----------------------------------------+----------------------+----------------------+

     +---------------------------------------------------------------------------------------+
     | Processes:                                                                            |
     |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
     |        ID   ID                                                             Usage      |
     |=======================================================================================|
     |    0   N/A  N/A      4881      G   /usr/bin/gnome-shell                          2MiB |
     +---------------------------------------------------------------------------------------+

     ***CPU***
     Architecture:                    x86_64
     CPU op-mode(s):                  32-bit, 64-bit
     Address sizes:                   46 bits physical, 48 bits virtual
     Byte Order:                      Little Endian
     CPU(s):                          20
     On-line CPU(s) list:             0-19
     Vendor ID:                       GenuineIntel
     Model name:                      12th Gen Intel(R) Core(TM) i7-12800H
     CPU family:                      6
     Model:                           154
     Thread(s) per core:              2
     Core(s) per socket:              14
     Socket(s):                       1
     Stepping:                        3
     CPU max MHz:                     4800.0000
     CPU min MHz:                     400.0000
     BogoMIPS:                        5606.40
     Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
     Virtualization:                  VT-x
     L1d cache:                       544 KiB (14 instances)
     L1i cache:                       704 KiB (14 instances)
     L2 cache:                        11.5 MiB (8 instances)
     L3 cache:                        24 MiB (1 instance)
     NUMA node(s):                    1
     NUMA node0 CPU(s):               0-19
     Vulnerability Itlb multihit:     Not affected
     Vulnerability L1tf:              Not affected
     Vulnerability Mds:               Not affected
     Vulnerability Meltdown:          Not affected
     Vulnerability Mmio stale data:   Not affected
     Vulnerability Retbleed:          Not affected
     Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
     Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
     Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
     Vulnerability Srbds:             Not affected
     Vulnerability Tsx async abort:   Not affected

     ***CMake***
     /home/clutz/deps/miniconda3/envs/gqe/bin/cmake
     cmake version 3.26.2

     CMake suite maintained and supported by Kitware (kitware.com/cmake).

     ***g++***
     /home/clutz/deps/miniconda3/envs/gqe/bin/g++
     g++ (conda-forge gcc 9.5.0-19) 9.5.0
     Copyright (C) 2019 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***
     /home/clutz/deps/miniconda3/envs/gqe/bin/nvcc
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2022 NVIDIA Corporation
     Built on Wed_Sep_21_10:33:58_PDT_2022
     Cuda compilation tools, release 11.8, V11.8.89
     Build cuda_11.8.r11.8/compiler.31833905_0

     ***Python***
     /home/clutz/deps/miniconda3/envs/gqe/bin/python
     Python 3.10.9

     ***Environment Variables***
     PATH                            : /nix/var/nix/profiles/default/bin:/home/clutz/deps/miniconda3/envs/gqe/bin:/home/clutz/deps/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/local/cuda-11/bin:/usr/local/cuda-11/bin:/usr/local/cuda-11/bin
     LD_LIBRARY_PATH                 :
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /home/clutz/deps/miniconda3/envs/gqe
     PYTHON_PATH                     :

     ***conda packages***
     /home/clutz/deps/miniconda3/condabin/conda
     # packages in environment at /home/clutz/deps/miniconda3/envs/gqe:
     #
     # Name                    Version                   Build  Channel
     _libgcc_mutex             0.1                 conda_forge    conda-forge
     _openmp_mutex             4.5                       2_gnu    conda-forge
     _sysroot_linux-64_curr_repodata_hack 3                   h5bd9786_13    conda-forge
     arrow-cpp                 10.0.1          ha770c72_15_cpu    conda-forge
     aws-c-auth                0.6.25               haec726b_4    conda-forge
     aws-c-cal                 0.5.21               hb6b25a1_1    conda-forge
     aws-c-common              0.8.12               h0b41bf4_0    conda-forge
     aws-c-compression         0.2.16               hea85486_4    conda-forge
     aws-c-event-stream        0.2.20               hddb6542_2    conda-forge
     aws-c-http                0.7.5                hdfd1699_3    conda-forge
     aws-c-io                  0.13.18              hf4b7f4e_3    conda-forge
     aws-c-mqtt                0.8.6                hfdaba90_9    conda-forge
     aws-c-s3                  0.2.6                hf4c50b5_0    conda-forge
     aws-c-sdkutils            0.1.7                hea85486_4    conda-forge
     aws-checksums             0.1.14               hea85486_4    conda-forge
     aws-crt-cpp               0.19.8               hd2d6fe4_7    conda-forge
     aws-sdk-cpp               1.10.57              hd18c533_7    conda-forge
     binutils                  2.39                 hdd6e379_1    conda-forge
     binutils_impl_linux-64    2.39                 he00db2b_1    conda-forge
     binutils_linux-64         2.39                h5fc0e48_11    conda-forge
     bzip2                     1.0.8                h7f98852_4    conda-forge
     c-ares                    1.18.1               h7f98852_0    conda-forge
     c-compiler                1.3.0                h7f98852_0    conda-forge
     ca-certificates           2022.12.7            ha878542_0    conda-forge
     cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
     clang                     11.1.0               ha770c72_1    conda-forge
     clang-11                  11.1.0          default_ha53f305_1    conda-forge
     clang-tools               11.1.0          default_ha53f305_1    conda-forge
     clangxx                   11.1.0          default_ha53f305_1    conda-forge
     cmake                     3.26.2               h077f3f9_0    conda-forge
     cubinlinker               0.2.2           py310hf09951c_0    rapidsai
     cuda                      11.5.2                        0    nvidia
     cuda-cccl                 12.1.55                       0    nvidia
     cuda-command-line-tools   12.1.0                        0    nvidia
     cuda-compiler             12.1.0                        0    nvidia
     cuda-cudart               12.1.55                       0    nvidia
     cuda-cudart-dev           12.1.55                       0    nvidia
     cuda-cudart-static        12.1.55                       0    nvidia
     cuda-cuobjdump            12.1.55                       0    nvidia
     cuda-cupti                12.1.62                       0    nvidia
     cuda-cupti-static         12.1.62                       0    nvidia
     cuda-cuxxfilt             12.1.55                       0    nvidia
     cuda-documentation        12.1.55                       0    nvidia
     cuda-driver-dev           12.1.55                       0    nvidia
     cuda-gdb                  12.1.55                       0    nvidia
     cuda-libraries            12.1.0                        0    nvidia
     cuda-libraries-dev        12.1.0                        0    nvidia
     cuda-libraries-static     12.1.0                        0    nvidia
     cuda-nsight               12.1.55                       0    nvidia
     cuda-nsight-compute       12.1.0                        0    nvidia
     cuda-nvcc                 12.1.66                       0    nvidia
     cuda-nvcc_linux-64        11.8.0                        0    nvidia
     cuda-nvdisasm             12.1.55                       0    nvidia
     cuda-nvml-dev             12.1.55                       0    nvidia
     cuda-nvprof               12.1.55                       0    nvidia
     cuda-nvprune              12.1.55                       0    nvidia
     cuda-nvrtc                12.1.55                       0    nvidia
     cuda-nvrtc-dev            12.1.55                       0    nvidia
     cuda-nvrtc-static         12.1.55                       0    nvidia
     cuda-nvtx                 12.1.66                       0    nvidia
     cuda-nvvp                 12.1.55                       0    nvidia
     cuda-opencl               12.1.56                       0    nvidia
     cuda-opencl-dev           12.1.56                       0    nvidia
     cuda-profiler-api         12.1.55                       0    nvidia
     cuda-python               11.8.1          py310h01a121a_2    conda-forge
     cuda-runtime              12.1.0                        0    nvidia
     cuda-sanitizer-api        12.1.55                       0    nvidia
     cuda-toolkit              12.1.0                        0    nvidia
     cuda-tools                12.1.0                        0    nvidia
     cuda-visual-tools         12.1.0                        0    nvidia
     cudatoolkit               11.5.1              h59c8dcf_11    conda-forge
     cudatoolkit-dev           11.5.0               h72bdee0_5    conda-forge
     cudf                      23.02.00        cuda_11_py310_g5ad4a85b9d_0    rapidsai
     cupy                      11.6.0          py310h9216885_0    conda-forge
     cxx-compiler              1.3.0                h4bd325d_0    conda-forge
     dlpack                    0.5                  h9c3ff4c_0    conda-forge
     expat                     2.5.0                h27087fc_0    conda-forge
     fastavro                  1.7.3           py310h1fa729e_0    conda-forge
     fastrlock                 0.8             py310hd8f1fbe_3    conda-forge
     fsspec                    2023.3.0           pyhd8ed1ab_1    conda-forge
     gcc                       9.5.0               h1fea6ba_11    conda-forge
     gcc_impl_linux-64         9.5.0               h99780fb_19    conda-forge
     gcc_linux-64              9.5.0               h4258300_11    conda-forge
     gds-tools                 1.6.0.25                      0    nvidia
     gflags                    2.2.2             he1b5a44_1004    conda-forge
     glog                      0.6.0                h6f12383_0    conda-forge
     gxx                       9.5.0               h1fea6ba_11    conda-forge
     gxx_impl_linux-64         9.5.0               h99780fb_19    conda-forge
     gxx_linux-64              9.5.0               h43f449f_11    conda-forge
     kernel-headers_linux-64   3.10.0              h4a8ded7_13    conda-forge
     keyutils                  1.6.1                h166bdaf_0    conda-forge
     krb5                      1.20.1               h81ceb04_0    conda-forge
     ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
     libabseil                 20230125.0      cxx17_hcb278e6_1    conda-forge
     libarrow                  10.0.1          h5ab077d_15_cpu    conda-forge
     libblas                   3.9.0           16_linux64_openblas    conda-forge
     libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
     libbrotlidec              1.0.9                h166bdaf_8    conda-forge
     libbrotlienc              1.0.9                h166bdaf_8    conda-forge
     libcblas                  3.9.0           16_linux64_openblas    conda-forge
     libclang-cpp11.1          11.1.0          default_ha53f305_1    conda-forge
     libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
     libcublas                 12.1.0.26                     0    nvidia
     libcublas-dev             12.1.0.26                     0    nvidia
     libcublas-static          12.1.0.26                     0    nvidia
     libcudf                   23.02.00        cuda11_g5ad4a85b9d_0    rapidsai
     libcufft                  11.0.2.4                      0    nvidia
     libcufft-dev              11.0.2.4                      0    nvidia
     libcufft-static           11.0.2.4                      0    nvidia
     libcufile                 1.6.0.25                      0    nvidia
     libcufile-dev             1.6.0.25                      0    nvidia
     libcufile-static          1.6.0.25                      0    nvidia
     libcurand                 10.3.2.56                     0    nvidia
     libcurand-dev             10.3.2.56                     0    nvidia
     libcurand-static          10.3.2.56                     0    nvidia
     libcurl                   7.88.1               hdc1c0ab_0    conda-forge
     libcusolver               11.4.4.55                     0    nvidia
     libcusolver-dev           11.4.4.55                     0    nvidia
     libcusolver-static        11.4.4.55                     0    nvidia
     libcusparse               12.0.2.55                     0    nvidia
     libcusparse-dev           12.0.2.55                     0    nvidia
     libcusparse-static        12.0.2.55                     0    nvidia
     libedit                   3.1.20191231         he28a2e2_2    conda-forge
     libev                     4.33                 h516909a_1    conda-forge
     libevent                  2.1.10               h28343ad_4    conda-forge
     libffi                    3.4.2                h7f98852_5    conda-forge
     libgcc-devel_linux-64     9.5.0               h0a57e50_19    conda-forge
     libgcc-ng                 12.2.0              h65d4601_19    conda-forge
     libgfortran-ng            12.2.0              h69a702a_19    conda-forge
     libgfortran5              12.2.0              h337968e_19    conda-forge
     libgomp                   12.2.0              h65d4601_19    conda-forge
     libgoogle-cloud           2.8.0                h0bc5f78_1    conda-forge
     libgrpc                   1.52.1               hcf146ea_1    conda-forge
     liblapack                 3.9.0           16_linux64_openblas    conda-forge
     libllvm11                 11.1.0               he0ac6c6_5    conda-forge
     libnghttp2                1.52.0               h61bc06f_0    conda-forge
     libnpp                    12.0.2.50                     0    nvidia
     libnpp-dev                12.0.2.50                     0    nvidia
     libnpp-static             12.0.2.50                     0    nvidia
     libnsl                    2.0.0                h7f98852_0    conda-forge
     libnvjitlink              12.1.55                       0    nvidia
     libnvjitlink-dev          12.1.55                       0    nvidia
     libnvjpeg                 12.1.0.39                     0    nvidia
     libnvjpeg-dev             12.1.0.39                     0    nvidia
     libnvjpeg-static          12.1.0.39                     0    nvidia
     libnvvm-samples           12.1.55                       0    nvidia
     libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
     libprotobuf               3.21.12              h3eb15da_0    conda-forge
     librmm                    23.02.00        cuda11_g48e8f2a8_0    rapidsai
     libsanitizer              9.5.0               h2f262e1_19    conda-forge
     libsqlite                 3.40.0               h753d276_0    conda-forge
     libssh2                   1.10.0               hf14f497_3    conda-forge
     libstdcxx-devel_linux-64  9.5.0               h0a57e50_19    conda-forge
     libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
     libthrift                 0.18.0               h5e4af38_0    conda-forge
     libutf8proc               2.8.0                h166bdaf_0    conda-forge
     libuuid                   2.32.1            h7f98852_1000    conda-forge
     libuv                     1.44.2               h166bdaf_0    conda-forge
     libzlib                   1.2.13               h166bdaf_4    conda-forge
     llvmlite                  0.39.1          py310h58363a5_1    conda-forge
     lz4-c                     1.9.4                hcb278e6_0    conda-forge
     ncurses                   6.3                  h27087fc_1    conda-forge
     nsight-compute            2023.1.0.15                   0    nvidia
     numba                     0.56.4          py310ha5257ce_0    conda-forge
     numpy                     1.23.5          py310h53a5b5f_0    conda-forge
     nvcc_linux-64             11.5                h44f499b_22    conda-forge
     nvtx                      0.2.3           py310h5764c6d_2    conda-forge
     openssl                   3.1.0                h0b41bf4_0    conda-forge
     orc                       1.8.2                hfdbbad2_2    conda-forge
     packaging                 23.0               pyhd8ed1ab_0    conda-forge
     pandas                    1.5.3           py310h9b08913_0    conda-forge
     parquet-cpp               1.5.1                         2    conda-forge
     pip                       23.0.1             pyhd8ed1ab_0    conda-forge
     protobuf                  4.21.12         py310heca2aa9_0    conda-forge
     ptxcompiler               0.7.0           py310h01a121a_3    conda-forge
     pyarrow                   10.0.1          py310h633f555_15_cpu    conda-forge
     python                    3.10.9          he550d4f_0_cpython    conda-forge
     python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
     python_abi                3.10                    3_cp310    conda-forge
     pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
     re2                       2023.02.02           hcb278e6_0    conda-forge
     readline                  8.1.2                h0f457ee_0    conda-forge
     rhash                     1.4.3                h166bdaf_0    conda-forge
     rmm                       23.02.00        cuda11_py310_g48e8f2a8_0    rapidsai
     s2n                       1.3.38               h3358134_0    conda-forge
     sed                       4.8                  he412f7d_0    conda-forge
     setuptools                67.6.0             pyhd8ed1ab_0    conda-forge
     six                       1.16.0             pyh6c4a22f_0    conda-forge
     snappy                    1.1.10               h9fff704_0    conda-forge
     spdlog                    1.8.5                h4bd325d_1    conda-forge
     sysroot_linux-64          2.17                h4a8ded7_13    conda-forge
     tk                        8.6.12               h27826a3_0    conda-forge
     typing_extensions         4.5.0              pyha770c72_0    conda-forge
     tzdata                    2022g                h191b570_0    conda-forge
     wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
     xz                        5.2.6                h166bdaf_0    conda-forge
     zlib                      1.2.13               h166bdaf_4    conda-forge
     zstd                      1.5.2                h3eb15da_6    conda-forge

vuule commented 1 year ago

It's possible that we're hitting the OS limit for the number of open files in a single process. Please try running the same code with environment variable LIBCUDF_CUFILE_POLICY="OFF". The number of times we open each file should be reduced in that case so libcudf shouldn't fail with less than ~1000 files.

LutzCle commented 1 year ago

Indeed, we're hitting the OS limit for the number of open files. Increasing the DefaultLimitNOLIMIT of systemd to one million files gets rid of the exception, and the program successfully runs to completion.

However, I was curious why the previous limit of 1024 files already caused the program to fail at about 350 files. strace-ing the program, it turns out that cudf::io::read_parquet opens each file three times:

openat(AT_FDCWD, "$HOME/datasets/tpcds-sf1-custom/store_sales/ss_sold_date_sk=2452129/part-00008-67832d71-8413-4d53-88fa-f1c1d791638d.c000.snappy.parquet", O_RDONLY) = 13
fstat(13, {st_mode=S_IFREG|0644, st_size=117314, ...}) = 0
openat(AT_FDCWD, "$HOME/datasets/tpcds-sf1-custom/store_sales/ss_sold_date_sk=2452129/part-00008-67832d71-8413-4d53-88fa-f1c1d791638d.c000.snappy.parquet", O_RDONLY|O_CLOEXEC) = 14
openat(AT_FDCWD, "$HOME/datasets/tpcds-sf1-custom/store_sales/ss_sold_date_sk=2452129/part-00008-67832d71-8413-4d53-88fa-f1c1d791638d.c000.snappy.parquet", O_RDONLY|O_DIRECT|O_CLOEXEC) = 15
mmap(NULL, 117314, PROT_READ, MAP_PRIVATE, 13, 0) = 0x7f5f98cec000
...

Also, all files remain open at the same time.

Yes, limiting the number of files that are simultaneously opened within the reader would solve the issue without tweaking OS parameters.

GregoryKimball commented 1 year ago

Hello @LutzCle, I'm very glad you found an option to proceed with this workflow by increasing DefaultLimitNOLIMIT. As far as why we are opening the file three times, I expect it is to access the file header, footer, and contents separately. Please let us know if you would like to discuss futher.

vuule commented 1 year ago

Hello @LutzCle, I'm very glad you found an option to proceed with this workflow by increasing DefaultLimitNOLIMIT. As far as why we are opening the file three times, I expect it is to access the file header, footer, and contents separately. Please let us know if you would like to discuss futher.

Yes, we open the file once to read metadata (e.g. footer) to system memory. kvikIO opens it again for device reads, but twice - with and without direct mode. Older GDS versions require direct mode, but non-GDS reads can be faster without direct mode, as caching can be leveraged. We can definitely open the files twice instead of three times. I believe newer GDS versions allow us to open the file only once, but this is a longer term item.

LutzCle commented 1 year ago

Thanks for getting back to me.

We can definitely open the files twice instead of three times. I believe newer GDS versions allow us to open the file only once, but this is a longer term item.

That's fine, but is still opening O(N) files at the same time. To solve the issue would require opening O(1) files at the same time, no?

rapidsai / cudf

[BUG] Parquet reader fails when trying to read hundreds of files. #13246