mbzuai-oryx / GeoChat

[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing
https://mbzuai-oryx.github.io/GeoChat
356 stars 23 forks source link

HF dataset not working #15

Closed DonggeunYu closed 3 months ago

DonggeunYu commented 3 months ago
from datasets import load_dataset

dataset = load_dataset("MBZUAI/GeoChat_Instruct", split="train", streaming=True)
print(next(iter(dataset)))
root@donggeun-selfsup-747b74575d-sj9n6:/nas/k8s/dev/mlops/donggeun/tools/hf_dataset# python3 test.py
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nas/k8s/dev/mlops/donggeun/tools/hf_dataset/test.py", line 4, in <module>
    print(next(iter(dataset)))
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 1384, in __iter__
    for key, example in ex_iterable:
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 282, in __iter__
    for key, pa_table in self.generate_tables_fn(**self.kwargs):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py", line 153, in _generate_tables
    pa_table = pa.Table.from_pydict(mapping)
  File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
KjAeRsTuIsK commented 3 months ago

Hi @DonggeunYu can you please try cloning it from huggingface and reading the json file? There was some issue while uploading the json file using hf wrapper and I had upload it by drag and drop. Let me know if you still face any problem.

DonggeunYu commented 3 months ago

@KjAeRsTuIsK How to open image data? (images_partaa, ...)

KjAeRsTuIsK commented 3 months ago

Please check this Data.md

DonggeunYu commented 3 months ago

Please check this Data.md

Nice! Thank you~