Closed Night-Quiet closed 1 year ago
This is the actual error:
Failed to read file '/home/lakala/hjc/code/pycode/glm/temp.json' with error <class 'pyarrow.lib.ArrowInvalid'>: cannot mix list and non-list, non-null values
Which means some samples are incorrectly formatted.
PyArrow, a storage backend that we use under the hood, requires that all the list elements have the same level of nesting (same number of dimensions) or are None
.
import pyarrow as pa
pa.array([[1, 2, 3], 2]) # ArrowInvalid: cannot mix list and non-list, non-null values
pa.array([[1, 2, 3], [2]]) # works
@mariosasko I used the same operation to check the original data before and after slicing. This is reflected in my code. 160000 is not a specific number. I can also get output using 150000. This doesn't seem to align very well with what you said. Because if only some sample formats are incorrect. So there should be an error in one of the front and back slices. thank you for your reply.
Our JSON loader does the following in your case:
import json
import pyarrow as pa
with open(file, encoding="utf-8") as f:
dataset = json.load(f)
keys = set().union(*[row.keys() for row in dataset])
mapping = {col: [row.get(col) for row in dataset] for col in keys}
pa_table = pa.Table.from_pydict(mapping) # the ArrowInvalid error comes from here
So if this code throws an error with correctly-formatted JSON, then this is an Arrow bug and should be reported in their repo.
I used the same operation to check the original data before and after slicing. This is reflected in my code. 160000 is not a specific number. I can also get output using 150000. This doesn't seem to align very well with what you said. Because if only some sample formats are incorrect. So there should be an error in one of the front and back slices.
You should shuffle the data to make sure that's not the case
@mariosasko Thank you. I will try again.
Describe the bug
I am using 'load_dataset 'loads a JSON file, but I found a strange bug: an error will be reported when the length of the JSON file exceeds 160000 (uncertain exact number). I have checked the data through the following code and there are no issues. So I cannot determine the true reason for this error.
The data is a list containing a dictionary. As follows:
[ {'input': 'someting...', 'target': 'someting...', 'type': 'someting...', 'history': ['someting...', ...]}, ... ]
Steps to reproduce the bug
Expected behavior
Environment info