Open bobir01 opened 7 months ago
Can you make a reproducible example? I don't have the parquet file you used. Ideally your example doesn't include any files, but creates an example from memory.
I think the MRE can be:
import io
df = pl.DataFrame({"a": [b"123", b"abc"]})
buf = io.StringIO()
df.write_json(buf, row_oriented=True)
Is there any reason why we wouldn't want to implement this for Binary here.
hi @ritchie46 he is the full reproducible code:
from tempfile import NamedTemporaryFile
from pprint import pprint
import polars as pl
from polars import selectors as sc
import requests
from pathlib import Path
base_dir = Path(__file__).parent
def get_pl_table(file: NamedTemporaryFile) -> pl.DataFrame:
df = pl.scan_parquet(file.name).select(
sc.by_dtype(pl.Binary).cast(pl.String),
sc.all().exclude(pl.Binary)
).collect()
# let's print the schema of the dataframe
pprint(df.schema)
# ERROR-prone code -> when enabling row_oriented=True, the code will fail
# for other cases, it will work fine
df.write_json(base_dir / 'tmp_sprt.json', row_oriented=True)
file.close()
return df
def get_parquet_file() -> NamedTemporaryFile:
base_url = 'https://cloud-api.yandex.net/v1/disk/public/resources/download?public_key=https://disk.yandex.com/d/bAhxN41-pwdGwA'
response = requests.get(base_url)
download_url = response.json()['href']
response = requests.get(download_url)
file = NamedTemporaryFile()
file.write(response.content)
return file
def main():
file = get_parquet_file()
get_pl_table(file)
if __name__ == '__main__':
main()
please, note this url is safe and on my cloud storage, i believe it's on the side of rust,because it failed only when enabling the row_oriented=True
Checks
Reproducible example
Log output
Issue description
When i enabling row_oriented behavior, it caused NotImplemented exception but It works fine when i disable this feature, i am trying to convert parquet file into json, default json output without
row_oriented=True
is around 17MBExpected behavior
should output row_oriented JSON:
Installed versions