vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

Error reading json file #2263

Open sl2902 opened 1 year ago

sl2902 commented 1 year ago

I was trying to read a sample json newline delimited file which is 374MB in size using the below function but it failed

wiki = vaex.from_json('../data/wikidata-20220926-all-ichunk_0.json', orient='table', copy_index=False)

I don't have any issue reading this file using the builtin open() function and yielding a json object in Python

TypeError                                 Traceback (most recent call last)
Cell In [11], line 1
----> 1 s = vaex.from_json('../data/wikidata-20220926-all-ichunk_0.json', orient='table', copy_index=False)

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/vaex/__init__.py:508, in from_json(path_or_buffer, orient, precise_float, lines, copy_index, **kwargs)
    505     raise ValueError('`chunksize` must be `None`.')
    507 import pandas as pd
--> 508 return from_pandas(pd.read_json(path_or_buffer, orient=orient, precise_float=precise_float, lines=lines, **kwargs),
    509                    copy_index=copy_index)

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/io/json/_json.py:757, in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options)
    754     return json_reader
    756 with json_reader:
--> 757     return json_reader.read()

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/io/json/_json.py:915, in JsonReader.read(self)
    913         obj = self._get_object_parser(self._combine_lines(data_lines))
    914 else:
--> 915     obj = self._get_object_parser(self.data)
    916 self.close()
    917 return obj

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/io/json/_json.py:937, in JsonReader._get_object_parser(self, json)
    935 obj = None
    936 if typ == "frame":
--> 937     obj = FrameParser(json, **kwargs).parse()
    939 if typ == "series" or obj is None:
    940     if not isinstance(dtype, bool):

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/io/json/_json.py:1064, in Parser.parse(self)
   1062     self._parse_numpy()
   1063 else:
-> 1064     self._parse_no_numpy()
   1066 if self.obj is None:
   1067     return None

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/io/json/_json.py:1337, in FrameParser._parse_no_numpy(self)
   1331     self.obj = DataFrame.from_dict(
   1332         loads(json, precise_float=self.precise_float),
   1333         dtype=None,
   1334         orient="index",
   1335     )
   1336 elif orient == "table":
-> 1337     self.obj = parse_table_schema(json, precise_float=self.precise_float)
   1338 else:
   1339     self.obj = DataFrame(
   1340         loads(json, precise_float=self.precise_float), dtype=None
   1341     )

File ~/.local/share/virtualenvs/14797-jD_mL20P/lib/python3.8/site-packages/pandas/io/json/_table_schema.py:351, in parse_table_schema(json, precise_float)
    315 """
    316 Builds a DataFrame from a given schema
    317 
   (...)
    348 pandas.read_json
    349 """
    350 table = loads(json, precise_float=precise_float)
--> 351 col_order = [field["name"] for field in table["schema"]["fields"]]
    352 df = DataFrame(table["data"], columns=col_order)[col_order]
    354 dtypes = {
    355     field["name"]: convert_json_field_to_pandas_type(field)
    356     for field in table["schema"]["fields"]
    357 }

TypeError: list indices must be integers or slices, not str

Any idea why it is failing?

Ideally, I would like to read a compressed json.bz2 file using the open() api and convert it to an arrow file. Is this possible?

JovanVeljanoski commented 1 year ago

Without access to the actual file it is hard to say. Perhaps you need to choose a different option for the orient keyword?

Yeah, i think it should be possible to read a compressed json file also. Vaex is using pandas under the hood to read json files, so if pandas can do it (and I am quite sure it can) vaex can do it also.