labbots / firestore-export-json

Convert google firestore/datastore LevelDB exports to JSON data.
MIT License
90 stars 23 forks source link

Decoding issue #18

Open jaredforth opened 2 months ago

jaredforth commented 2 months ago

Hello,

I'm running python fs_to_json.py ../2024-08-27T13:49:35_42810/all_namespaces/all_kinds/ out where ../2024-08-27T13:49:35_42810/all_namespaces/all_kinds/ is

├── all_namespaces_all_kinds.export_metadata
├── output-0
├── output-1
├── output-10
├── output-11
├── output-12
├── output-13
├── output-14
├── output-15
├── output-16
├── output-17
├── output-18
├── output-19
├── output-2
├── output-3
├── output-4
├── output-5
├── output-6
├── output-7
├── output-8
└── output-9

1 directory, 21 files

The conversion worked for 9 files but created 12 empty JSON files. Is this a known bug and are there any tips on how to resolve?

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/Users/jaredforth/Downloads/firestore-export-json/converter/command.py", line 145, in process_file
    json.dumps(json_tree, default=serialize_json, ensure_ascii=False, indent=2)
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
          ^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 202, in encode
    chunks = list(chunks)
             ^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 432, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict
    yield from chunks
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 326, in _iterencode_list
    yield from chunks
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/json/encoder.py", line 439, in _iterencode
    o = _default(o)
        ^^^^^^^^^^^
  File "/Users/jaredforth/Downloads/firestore-export-json/converter/utils.py", line 77, in serialize_json
    return str(obj)
           ^^^^^^^^
  File "/Users/jaredforth/Downloads/firestore-export-json/venv/lib/python3.12/site-packages/google/appengine/api/datastore_types.py", line 1227, in __str__
    return self.decode('utf-8')
           ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 24: invalid continuation byte
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jaredforth/Downloads/firestore-export-json/fs_to_json.py", line 14, in <module>
    main()
  File "/Users/jaredforth/Downloads/firestore-export-json/fs_to_json.py", line 10, in main
    command.main(args=args)
  File "/Users/jaredforth/Downloads/firestore-export-json/converter/command.py", line 93, in main
    process_files(
  File "/Users/jaredforth/Downloads/firestore-export-json/converter/command.py", line 113, in process_files
    p.map(f, files)
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.5/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xeb in position 24: invalid continuation byte
LeoPraktisk commented 2 months ago

I am having the same issue. I also am having issues when parsing Docuemnts that have Collection within them.

LeoPraktisk commented 2 months ago

I have been fidgeting with a little and sortof found a solution.

I change the serialize_json function within the utils.py, with the following code:

def serialize_json(obj):
    try:
        if isinstance(obj, datetime.datetime):
            if obj.utcoffset() is not None:
                obj = obj - obj.utcoffset()
            millis = int(calendar.timegm(obj.timetuple()) * 1000 + obj.microsecond / 1000)
            return millis
        return str(obj)
    except UnicodeDecodeError:
        return obj.decode("utf-8", errors="ignore")

the old code looked like this:

def serialize_json(obj):
      if isinstance(obj, datetime.datetime):
          if obj.utcoffset() is not None:
              obj = obj - obj.utcoffset()
          millis = int(calendar.timegm(obj.timetuple()) * 1000 + obj.microsecond / 1000)
          return millis
      return str(obj)

This fixes the error or morso bypasses the problem, but there are still parts that don't work. When it runs over a list of objects, it seems unable to parse it properly and the list becomes a jumble of unicode escape sequences.

I don't know if this helps, but it will allow you to run the program without any errors.