pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.34k stars 1.96k forks source link

Writing Parquet to JSON when row_oriented=True #15410

Open bobir01 opened 7 months ago

bobir01 commented 7 months ago

Checks

Reproducible example

from tempfile import NamedTemporaryFile
import polars as pl
from polars import selectors as sc

def get_pl_table(file: NamedTemporaryFile) -> pl.DataFrame:
    s_time = time.time()

    df_lazy = pl.scan_parquet(file.name).select(
        sc.by_dtype(pl.Binary).cast(pl.String),
        sc.all().exclude(pl.Binary)
    ).collect()

    df_lazy.write_json(Path(__file__).parent / 'data'/ "support_ticket_sla1.json", row_oriented=True)

Log output

/Users/bobdev/PycharmProjects/korzinkaGo/.venv/bin/python /Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py 
Time elapsed for s3 download: 3.2596969604492188

thread '<unnamed>' panicked at crates/polars-json/src/json/write/serialize.rs:494:18:
not yet implemented: Writing BinaryView to JSON
stack backtrace:
   0:        0x175468524 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h6aecb9d07bb8db1b
   1:        0x17362c24c - core::fmt::write::h57932930d5d73fd6
   2:        0x1754420d4 - std::io::Write::write_fmt::hb638df451817bf25
   3:        0x17546b3f0 - std::sys_common::backtrace::print::h6910c90959d8cad9
   4:        0x17546ad34 - std::panicking::default_hook::{{closure}}::h1bb1130b8dcb1188
   5:        0x17546c600 - std::panicking::rust_panic_with_hook::hfc5079cd9be86c57
   6:        0x17546b720 - std::panicking::begin_panic_handler::{{closure}}::h5532dad9383be66c
   7:        0x17546b684 - std::sys_common::backtrace::__rust_end_short_backtrace::h39762fc8d44c97d9
   8:        0x17546b678 - _rust_begin_unwind
   9:        0x1756092f4 - core::panicking::panic_fmt::hdaff94c2cbb4d934
  10:        0x174329e60 - polars_json::json::write::serialize::new_serializer::h41779dfb65e9fc7c
  11:        0x174328708 - polars_json::json::write::serialize::new_serializer::h41779dfb65e9fc7c
  12:        0x174328f70 - polars_json::json::write::serialize::new_serializer::h41779dfb65e9fc7c
  13:        0x174334630 - polars_json::json::write::serialize::serialize::h65b146cc2bb39732
  14:        0x1731b2da8 - <polars_io::json::JsonWriter<W> as polars_io::SerWriter<W>>::finish::h2955a06d621ed22b
  15:        0x1733f4ee0 - polars::dataframe::_::<impl polars::dataframe::PyDataFrame>::__pymethod_write_json__::hdb909103cdc666f3
  16:        0x172ffa130 - pyo3::impl_::trampoline::trampoline::hccfd42f2554b28e9
  17:        0x173554d70 - polars::dataframe::_::_::__INVENTORY::trampoline::h60e14f143153bc4a
  18:        0x102ce1b70 - _method_vectorcall_VARARGS_KEYWORDS
  19:        0x102e09160 - _call_function
  20:        0x102e0050c - __PyEval_EvalFrameDefault
  21:        0x102df9f44 - __PyEval_Vector
  22:        0x102cd6dcc - _method_vectorcall
  23:        0x102e09160 - _call_function
  24:        0x102dffc80 - __PyEval_EvalFrameDefault
  25:        0x102df9f44 - __PyEval_Vector
  26:        0x102e09160 - _call_function
  27:        0x102dffbfc - __PyEval_EvalFrameDefault
  28:        0x102df9f44 - __PyEval_Vector
  29:        0x102e09160 - _call_function
  30:        0x102dffbfc - __PyEval_EvalFrameDefault
  31:        0x102df9f44 - __PyEval_Vector
  32:        0x102e647cc - _pyrun_file
  33:        0x102e63f10 - __PyRun_SimpleFileObject
  34:        0x102e6355c - __PyRun_AnyFileObject
  35:        0x102e8f76c - _pymain_run_file_obj
  36:        0x102e8ee0c - _pymain_run_file
  37:        0x102e8e3f4 - _pymain_run_python
  38:        0x102e8e288 - _Py_RunMain
  39:        0x102e8f914 - _pymain_main
  40:        0x102e8fbd8 - _Py_BytesMain
Traceback (most recent call last):
  File "/Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py", line 118, in <module>
    main()
  File "/Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py", line 113, in main
    table = get_pl_table(file)
  File "/Users/bobdev/PycharmProjects/korzinkaGo/aws_tasks/commpnl_upsert.py", line 53, in get_pl_table
    df_lazy.write_json(Path(__file__).parent / 'data'/ "support_ticket_sla1.json", row_oriented=True)
  File "/Users/bobdev/PycharmProjects/korzinkaGo/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py", line 2490, in write_json
    self._df.write_json(file, pretty, row_oriented)
pyo3_runtime.PanicException: not yet implemented: Writing BinaryView to JSON

Process finished with exit code 1

Issue description

When i enabling row_oriented behavior, it caused NotImplemented exception but It works fine when i disable this feature, i am trying to convert parquet file into json, default json output without row_oriented=True is around 17MB

Expected behavior

should output row_oriented JSON:

Installed versions

``` --------Version info--------- Polars: 0.20.16 Index type: UInt32 Platform: macOS-13.6-arm64-arm-64bit Python: 3.10.3 (v3.10.3:a342a49189, Mar 16 2022, 09:34:18) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.2.0 gevent: hvplot: matplotlib: numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.0 pyarrow: 15.0.1 pydantic: 2.6.1 pyiceberg: pyxlsb: sqlalchemy: 1.4.51 xlsx2csv: xlsxwriter: 3.2.0 ```
ritchie46 commented 7 months ago

Can you make a reproducible example? I don't have the parquet file you used. Ideally your example doesn't include any files, but creates an example from memory.

reswqa commented 7 months ago

I think the MRE can be:

import io
df = pl.DataFrame({"a": [b"123", b"abc"]})
buf = io.StringIO()
df.write_json(buf, row_oriented=True)

Is there any reason why we wouldn't want to implement this for Binary here.

https://github.com/pola-rs/polars/blob/758b55a58010b45c4b4e06ee500d1e8b16cba547/crates/polars-json/src/json/write/serialize.rs#L385-L390

bobir01 commented 7 months ago

hi @ritchie46 he is the full reproducible code:

from tempfile import NamedTemporaryFile
from pprint import pprint
import polars as pl
from polars import selectors as sc
import requests
from pathlib import Path

base_dir = Path(__file__).parent

def get_pl_table(file: NamedTemporaryFile) -> pl.DataFrame:
    df = pl.scan_parquet(file.name).select(
        sc.by_dtype(pl.Binary).cast(pl.String),
        sc.all().exclude(pl.Binary)
    ).collect()
    # let's print the schema of the dataframe
    pprint(df.schema)
    # ERROR-prone code -> when enabling row_oriented=True, the code will fail
    # for other cases, it will work fine
    df.write_json(base_dir / 'tmp_sprt.json', row_oriented=True)
    file.close()
    return df

def get_parquet_file() -> NamedTemporaryFile:
    base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?public_key=https://disk.yandex.com/d/bAhxN41-pwdGwA'
    response = requests.get(base_url)
    download_url = response.json()['href']
    response = requests.get(download_url)
    file = NamedTemporaryFile()
    file.write(response.content)
    return file

def main():
    file = get_parquet_file()
    get_pl_table(file)

if __name__ == '__main__':
    main()

please, note this url is safe and on my cloud storage, i believe it's on the side of rust,because it failed only when enabling the row_oriented=True