moj-analytical-services / mojap-arrow-pd-parser

Conforms pandas to "correct" datatypes to ensure data in/out using CSV, JSONL and Parquet is read the same (using arrow).
MIT License
9 stars 1 forks source link

Integer chunksize value only works with JSONL strings #95

Open maliksMOJ opened 1 year ago

maliksMOJ commented 1 year ago

arrow-pd-parser should support two different value types for the chunksize variable (string value denoting the memory allocation size i.e. 1GB or an integer value specifying how many rows to split by). However when specifying an integer value, the reader will only successfully split data from a JSONL file (line-delimited). I was unable to chunk when giving a comma-delimited JSON file.

mratford commented 1 year ago

This issue won't be easily solved, as pandas and awswrangler only support chunking for line-delimited json files.

We could possibly use smart_open.open and readline? It might need some tricky parsing if the json records are across different lines.