Open CrashLaker opened 1 year ago
We currently do not planning on allowing multiple schemas as input to ReadAPI methods. One possible workaround is to read files in grouped by schema first, use map
functions to transform datasets into having the same schema, then use union
to combine them into a single Dataset.
What happened + What you expected to happen
Hi all,
I'm trying to read a folder whose contents has slight schema variation.
expected to work
in
./dataset/data{0..N}.json
I'm getting an error of the sort:
(DoRead pid=44337) pyarrow.lib.ArrowInvalid: Unable to merge: Field Records has incompatible types: struct<a: int64, b: string> vs struct<a: int64, b: int64> [repeated 5x across cluster]
I also can't seem to be able to force the schema
explicit_schema
having the error:(DoRead pid=45091) pyarrow.lib.ArrowInvalid: JSON parse error: Column(/Records/b) changed from string to number in row 0
is there any workaround to this? or a way to cast after reading?
ty.
regards,c.
Versions / Dependencies
uname -a
Linux ip-172-31-33-24.sa-east-1.compute.internal 6.1.27-43.48.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 2 04:53:36 UTC 2023 x86_64 x86_64 x86_64 GNU/Linuxpython3 --version
Python 3.9.16pip3 freeze | grep -E "ray|pandas|pyarrow"
pandas==2.0.2 pyarrow==12.0.0 ray==2.4.0Reproduction script
gen dataset
read folder
read folder with explicit_schema
Issue Severity
High: It blocks me from completing my task.