Closed rgsl888prabhu closed 4 years ago
Describe the bug A str data-set with 64 columns in json format when read with cudf.read_json just hangs. As per the observation, issue seems to lurk in libcudf.
str
cudf.read_json
attached file with the issue jsonstrinfer_v2.zip
Steps/Code to reproduce bug
>>> import cudf >>> cudf.read_json("~/datasets/jsonstrinfer_v2", compression="infer", lines=True, orient="records")
Expected behavior
>>> import pandas as pd >>> pd.read_json("~/datasets/jsonstrinfer_v2", compression="infer", lines=True, orient="records") 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ... 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 e4:2 None \rVGk None 'dQ> None vHU\ None ~%5* None MINY None ]f7j None v\r*g None ... CDzm 9Qp| 0aQ dmop PdE- &0Ux v\reh |2q8 52`} <xxS bdu' None a.&s x=9= Qp9a K8Q` 1 ]2(S None 4,b] None +=hk None D2J, None aa*m None 0-/> None ]oQ" None { 2} None ... None None None None None None None None None None None c[ r "njF C&e( <Qg{ _:za ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 4467 None 55q^ None []~f None Iopw None P/Tl None N+ 1 None @{L= None IBL; None j;.# ... cKkW None ;JWF None &/Hh None H#t None "q*) None xr-N None =3az oH"5 F0T^ N))[ [4468 rows x 64 columns] >>>
Environment overview (please complete the following information)
Environment details
**git*** commit 3f995b1d11f2190adb1116125eae777fb3c893f0 (HEAD -> 5267_json_avro_benchmark) Merge: f56dc0e68a 89247312e6 Author: Ramakrishna Prabhu Date: Fri Sep 18 14:38:39 2020 -0500 Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into 5267_json_avro_benchmark **git submodules*** ***OS Information*** DISTRIB_ID=Ubuntu DISTRIB_RELEASE=18.04 DISTRIB_CODENAME=bionic DISTRIB_DESCRIPTION="Ubuntu 18.04.3 LTS" NAME="Ubuntu" VERSION="18.04.3 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.3 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic Linux onepiece 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ***GPU Information*** Fri Sep 18 17:48:07 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro RTX 8000 Off | 00000000:15:00.0 Off | Off | | 33% 51C P8 38W / 260W | 1MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 Off | 00000000:2D:00.0 Off | Off | | 33% 37C P8 16W / 260W | 89MiB / 48592MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 1 1112 G /usr/lib/xorg/Xorg 87MiB | +-----------------------------------------------------------------------------+ ***CPU*** Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz Stepping: 4 CPU MHz: 1200.272 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000 BogoMIPS: 6800.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 19712K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d ***CMake*** /home/rgsl888/anaconda3/envs/cudf_dev/bin/cmake cmake version 3.14.0 CMake suite maintained and supported by Kitware (kitware.com/cmake). ***g++*** /usr/bin/g++ g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ***nvcc*** /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243 ***Python*** /home/rgsl888/anaconda3/envs/cudf_dev/bin/python Python 3.7.8 ***Environment Variables*** PATH : /home/rgsl888/.local/bin:/home/rgsl888/anaconda3/envs/cudf_dev/bin:/home/rgsl888/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/cuda/bin LD_LIBRARY_PATH : :/usr/local/cuda/lib64:/usr/local/cuda/lib64 NUMBAPRO_NVVM : NUMBAPRO_LIBDEVICE : CONDA_PREFIX : /home/rgsl888/anaconda3/envs/cudf_dev PYTHON_PATH : ***conda packages*** /home/rgsl888/anaconda3/condabin/conda # packages in environment at /home/rgsl888/anaconda3/envs/cudf_dev: # # Name Version Build Channel _libgcc_mutex 0.1 main abseil-cpp 20200225.2 he1b5a44_2 conda-forge alabaster 0.7.12 py37_0 appdirs 1.4.4 py_0 argon2-cffi 20.1.0 py37h7b6447c_1 arrow-cpp 1.0.1 py37h1234567_1_cuda conda-forge arrow-cpp-proc 1.0.1 cuda conda-forge attrs 20.1.0 py_0 aws-sdk-cpp 1.7.164 hba45d7a_2 conda-forge babel 2.8.0 py_0 backcall 0.2.0 py_0 black 19.10b0 py_4 conda-forge bleach 3.1.5 py_0 bokeh 2.2.1 py37_0 boost-cpp 1.72.0 h9359b55_3 conda-forge brotli 1.0.7 he6710b0_0 brotlipy 0.7.0 py37h7b6447c_1000 bzip2 1.0.8 h7b6447c_0 c-ares 1.16.1 h516909a_3 conda-forge ca-certificates 2020.6.20 hecda079_0 conda-forge certifi 2020.6.20 py37hc8dfbb8_0 conda-forge cffi 1.14.0 py37h2e261b9_0 cfgv 3.2.0 py_0 conda-forge chardet 3.0.4 py37_1003 clang 8.0.1 hc9558a2_2 conda-forge clang-tools 8.0.1 hc9558a2_2 conda-forge clangxx 8.0.1 2 conda-forge click 7.1.2 py_0 cloudpickle 1.6.0 py_0 cmake 3.14.0 h52cb24c_0 anaconda cmake_setuptools 0.1.3 py_0 rapidsai commonmark 0.9.1 py_0 cryptography 3.1 py37h1ba5d50_0 cudatoolkit 10.2.89 h6bb024c_0 nvidia cudf 0.8.0+23082.g5b1b19bb96 pypi_0 pypi cudnn 7.6.5 cuda10.2_0 cupy 7.8.0 py37h940342b_1 conda-forge curl 7.71.1 hbc83047_1 cython 0.29.21 py37h3340039_0 conda-forge cytoolz 0.10.1 py37h7b6447c_0 dask 2.26.0+11.gc006c197 pypi_0 pypi dask-core 2.24.0 py_0 dask-cudf 0.8.0+22298.g8ade7ca5eb.dirty pypi_0 pypi decorator 4.4.2 py_0 defusedxml 0.6.0 py_0 distlib 0.3.1 pyh9f0ad1d_0 conda-forge distributed 2.24.0+9.ga848ee0c pypi_0 pypi dlpack 0.3 he1b5a44_1 conda-forge docutils 0.16 py37_1 double-conversion 3.1.5 he1b5a44_2 conda-forge editdistance 0.5.3 py37h3340039_1 conda-forge entrypoints 0.3 py37_0 expat 2.2.9 he6710b0_2 fastavro 1.0.0.post1 py37h8f50634_0 conda-forge fastrlock 0.5 py37he6710b0_0 filelock 3.0.12 py_0 flake8 3.8.3 py_1 conda-forge flatbuffers 1.12.0 he1b5a44_0 conda-forge freetype 2.10.2 h5ab3b9f_0 fsspec 0.8.2 py_0 conda-forge future 0.18.2 py37_1 gflags 2.2.2 he6710b0_0 glog 0.4.0 he6710b0_0 gmp 6.1.2 h6c8ec71_1 grpc-cpp 1.30.2 heedbac9_0 conda-forge heapdict 1.0.1 py_0 hypothesis 5.28.0 py_0 conda-forge icu 67.1 he1b5a44_0 conda-forge identify 1.4.29 pyh9f0ad1d_0 conda-forge idna 2.10 py_0 imagesize 1.2.0 py_0 importlib-metadata 1.7.0 py37_0 importlib_metadata 1.7.0 0 iniconfig 1.0.1 py_0 ipykernel 5.3.4 py37h5ca1d4c_0 ipython 7.18.1 py37hc6149b9_0 conda-forge ipython_genutils 0.2.0 py37_0 isort 5.0.7 py37hc8dfbb8_0 conda-forge jedi 0.17.2 py37_0 jinja2 2.11.2 py_0 jpeg 9b h024ee3a_2 jsonschema 3.2.0 py37_1 jupyter_client 6.1.6 py_0 jupyter_core 4.6.3 py37_0 krb5 1.18.2 h173b8e3_0 lcms2 2.11 h396b838_0 ld_impl_linux-64 2.34 hc38a660_9 conda-forge libblas 3.8.0 17_openblas conda-forge libcblas 3.8.0 17_openblas conda-forge libcurl 7.71.1 h20c2e04_1 libedit 3.1.20191231 h14c3975_1 libevent 2.1.10 hcdb4288_2 conda-forge libffi 3.2.1 hd88cf55_4 libgcc-ng 9.1.0 hdf63c60_0 libgfortran-ng 7.3.0 hdf63c60_0 liblapack 3.8.0 17_openblas conda-forge libllvm10 10.0.1 hbcb73fb_5 libllvm8 8.0.1 hc9558a2_0 conda-forge libopenblas 0.3.10 h5a2b251_0 libpng 1.6.37 hbc83047_0 libprotobuf 3.12.4 hd408876_0 librmm 0.16.0a200825 cuda10.2_g47c8346_264 rapidsai-nightly libsodium 1.0.18 h7b6447c_0 libssh2 1.9.0 h1ba5d50_1 libstdcxx-ng 9.1.0 hdf63c60_0 libtiff 4.1.0 h2733197_1 libutf8proc 2.5.0 h516909a_2 conda-forge llvmlite 0.34.0 py37h269e1b5_4 locket 0.2.0 py37_1 lz4-c 1.9.2 he6710b0_1 markdown 3.2.2 py37_0 markupsafe 1.1.1 py37h14c3975_1 mccabe 0.6.1 py37_1 mimesis 4.0.0 pyh9f0ad1d_0 conda-forge mistune 0.8.4 py37h14c3975_1001 more-itertools 8.5.0 py_0 msgpack-python 1.0.0 py37hfd86e86_1 nbconvert 5.6.1 py37_1 nbformat 5.0.7 py_0 nbsphinx 0.7.1 pyh9f0ad1d_0 conda-forge nccl 2.7.8.1 hc6a2c23_0 conda-forge ncurses 6.2 he6710b0_1 nodeenv 1.5.0 pyh9f0ad1d_0 conda-forge notebook 6.1.4 py37hc8dfbb8_0 conda-forge numba 0.51.2 py37h9fdb41a_0 conda-forge numpy 1.19.1 py37h7ea13bd_2 conda-forge numpydoc 1.1.0 pyh9f0ad1d_0 conda-forge olefile 0.46 py37_0 openssl 1.1.1g h516909a_1 conda-forge packaging 20.4 pyh9f0ad1d_0 conda-forge pandas 1.1.2 py37h3340039_0 conda-forge pandavro 1.5.2 py_0 conda-forge pandoc 1.19.2 0 conda-forge pandocfilters 1.4.2 py37_1 parquet-cpp 1.5.1 2 conda-forge parso 0.7.0 py_0 partd 1.1.0 py_0 conda-forge pathspec 0.7.0 py_0 pexpect 4.8.0 py37_1 pickleshare 0.7.5 py37_1001 pillow 7.2.0 py37hb39fc2d_0 pip 20.2.3 py_0 conda-forge pluggy 0.13.1 py37_0 pre-commit 2.7.1 py37hc8dfbb8_0 conda-forge pre_commit 2.7.1 0 conda-forge prometheus_client 0.8.0 py_0 prompt-toolkit 3.0.7 py_0 psutil 5.7.2 py37h7b6447c_0 ptyprocess 0.6.0 py37_0 py 1.9.0 py_0 py-cpuinfo 7.0.0 py_0 pyarrow 1.0.1 py37h1234567_1_cuda conda-forge pycodestyle 2.6.0 py_0 pycparser 2.20 py_2 pyflakes 2.2.0 py_0 pygments 2.6.1 py_0 pynvml 8.0.4 py_1 conda-forge pyopenssl 19.1.0 py_1 pyparsing 2.4.7 py_0 pyrsistent 0.16.0 py37h7b6447c_0 pysocks 1.7.1 py37_1 pytest 6.0.2 py37hc8dfbb8_0 conda-forge pytest-benchmark 3.2.3 pyh9f0ad1d_0 conda-forge python 3.7.8 h6f2ec95_1_cpython conda-forge python-dateutil 2.8.1 py_0 python_abi 3.7 1_cp37m conda-forge pytz 2020.1 py_0 pyyaml 5.3.1 py37h7b6447c_1 pyzmq 19.0.1 py37he6710b0_1 rapidjson 1.1.0 he1b5a44_1002 conda-forge re2 2020.08.01 he1b5a44_1 conda-forge readline 8.0 h7b6447c_0 recommonmark 0.6.0 py_0 conda-forge regex 2020.7.14 py37h7b6447c_0 requests 2.24.0 py_0 rhash 1.3.8 h1ba5d50_0 rmm 0.16.0a200918 cuda_10.2_py37_gf4de029_379 rapidsai-nightly send2trash 1.5.0 py37_0 setuptools 49.6.0 py37_0 six 1.15.0 py_0 snappy 1.1.8 he6710b0_0 snowballstemmer 2.0.0 py_0 sortedcontainers 2.2.2 py_0 spdlog 1.7.0 hc9558a2_2 conda-forge sphinx 3.2.1 py_0 conda-forge sphinx-copybutton 0.3.0 pyh9f0ad1d_0 conda-forge sphinx-markdown-tables 0.0.14 pyh9f0ad1d_1 conda-forge sphinx_rtd_theme 0.5.0 pyh9f0ad1d_0 conda-forge sphinxcontrib-applehelp 1.0.2 py_0 sphinxcontrib-devhelp 1.0.2 py_0 sphinxcontrib-htmlhelp 1.0.3 py_0 sphinxcontrib-jsmath 1.0.1 py_0 sphinxcontrib-qthelp 1.0.3 py_0 sphinxcontrib-serializinghtml 1.1.4 py_0 sphinxcontrib-websupport 1.2.4 pyh9f0ad1d_0 conda-forge sqlite 3.33.0 h62c20be_0 streamz 0.5.5 pypi_0 pypi tblib 1.7.0 py_0 terminado 0.8.3 py37_0 testpath 0.4.4 py_0 thrift-cpp 0.13.0 h62aa4f2_3 conda-forge tk 8.6.10 hbc83047_0 toml 0.10.1 py_0 toolz 0.10.0 py_0 tornado 6.0.4 py37h7b6447c_1 traitlets 4.3.3 py37_0 typed-ast 1.4.1 py37h7b6447c_0 typing_extensions 3.7.4.3 py_0 urllib3 1.25.10 py_0 virtualenv 20.0.20 py37hc8dfbb8_1 conda-forge wcwidth 0.2.5 py_0 webencodings 0.5.1 py37_1 wheel 0.35.1 py_0 xz 5.2.5 h7b6447c_0 yaml 0.2.5 h7b6447c_0 zeromq 4.3.2 he6710b0_3 zict 2.0.0 py_0 zipp 3.1.0 py_0 zlib 1.2.11 h7b6447c_3 zstd 1.4.5 h9ceee32_0
A smaller data-set sample.zip to reproduce with Just 2 rows and 9 columns
, issue seems to be a single " which is leading to this.
"
Describe the bug A
str
data-set with 64 columns in json format when read withcudf.read_json
just hangs. As per the observation, issue seems to lurk in libcudf.attached file with the issue jsonstrinfer_v2.zip
Steps/Code to reproduce bug
Expected behavior
Environment overview (please complete the following information)
Environment details
Click here to see environment details