mohamed-chs / chatgpt-history-export-to-md

A script to effortlessly extract your entire ChatGPT data export from JSON files to nicely-formatted markdown files.
MIT License
707 stars 35 forks source link

TypeError in ConversationSet.from_zip: Unexpected Dictionary Instead of String #29

Open trentleslie opened 1 year ago

trentleslie commented 1 year ago

Describe the bug When attempting to load a conversation set using Convoviz, a pydantic_core._pydantic_core.ValidationError is encountered. This error arises due to a type mismatch in the ConversationSet model, where a string is expected, but a dictionary is received in message.content.parts. This issue prevents successful data loading and processing.

To Reproduce Steps to reproduce the behavior:

Use Convoviz to load a conversation set from a .zip file using ConversationSet.from_zip. Encounter a ValidationError during the process. The error details point to multiple instances in array where message.content.parts expects a string but receives a dictionary. Expected behavior The expected behavior is for Convoviz to successfully load the conversation set without any type mismatch errors. The ConversationSet model should correctly handle the data structure provided in the .zip file.

OS: Windows 10

Additional context This issue seems to stem from a mismatch between the expected data format in the Pydantic model and the actual data structure being processed. It might require either adjusting the data format or modifying the Pydantic model to align with the actual data structure.

image

yb66 commented 1 year ago

That looks similar to the errors I got running the project just now:

$ python -m convoviz
Welcome to ChatGPT Data Visualizer βœ¨πŸ“Š!

Follow the instructions in the command line.

Press 'ENTER' to select the default options.

If you encounter any issues πŸ›, please report 🚨 them here:

➑️ https://github.com/mohamed-chs/chatgpt-history-export-to-md/issues/new/choose πŸ”—

? Enter the path to the zip file : /Users/$USER/Downloads/$MYCHATGPTDATA.zip
? Enter the path to the output folder : /Users/$USER/Documents/ChatGPT Data
? Enter the message header (#) for messages from 'system' : ### System
? Enter the message header (#) for messages from 'user' : # Me
? Enter the message header (#) for messages from 'assistant' : # ChatGPT
? Enter the message header (#) for messages from 'tool' : ### Tool output
? Select the LaTeX math delimiters you want to use : default
? Select the YAML metadata headers you want to include : done (9 selections)
? Select the font you want to use for the word clouds : RobotoSlab-Thin
? Select the color theme you want to use for the word clouds : prism
? Enter custom stopwords (separated by commas) : use, file,

So that's all defaults…

And we're off! πŸš€πŸš€πŸš€

Loading data πŸ“‚ ...

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/__main__.py", line 5, in <module>
    main()
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/cli.py", line 38, in main
    entire_collection = ConversationSet.from_zip(user.configs["zip_filepath"])
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/models/_conversation_set.py", line 55, in from_zip
    return cls.from_json(convos_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/models/_conversation_set.py", line 47, in from_json
    return cls(array=loads(file.read()))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/pydantic/main.py", line 164, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 6 validation errors for ConversationSet
array.21.mapping.f5cb0d2f-8f68-450e-8833-6ace9ef265fa.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... generation metadata'}}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
array.21.mapping.aaa22a71-acfd-4819-87c8-cd7b2dafdce7.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... None, 'metadata': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
array.21.mapping.f14959ea-0cf5-437b-b0fb-553588b560ab.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... generation metadata'}}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
array.21.mapping.172f3f20-a25b-4e7a-a1e6-db63656120e4.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... generation metadata'}}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
array.47.mapping.b01c89d0-e53d-4d42-8f6b-8070c865f7bd.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... None, 'metadata': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
array.78.mapping.aaa25b4c-1884-4a9e-a936-36f0efdb906d.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... None, 'metadata': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type

I used the project around 2 months ago with no problem so I used ChatGPT data from that time and today's version of the code parses it fine, which means there's probably been a change in the ChatGPT data format. I had a quick look at some of the old data compared to the new but didn't find a meaningful structural difference (yet).

Regards, iain

yb66 commented 1 year ago

I might have found the problem (for me anyway, and maybe there are others). In short, I think the models need updating.

I saw issue #28 and tried removing any DALL-E references (I only had one conversation with those) and re-running convoviz but it still produced errors:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/__main__.py", line 5, in <module>
    main()
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/cli.py", line 38, in main
    entire_collection = ConversationSet.from_zip(user.configs["zip_filepath"])
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/models/_conversation_set.py", line 55, in from_zip
    return cls.from_json(convos_path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/convoviz/models/_conversation_set.py", line 47, in from_json
    return cls(array=loads(file.read()))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/$USER/Library/Python/3.11/lib/python/site-packages/pydantic/main.py", line 164, in __init__
    __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ConversationSet
array.46.mapping.b01c89d0-e53d-4d42-8f6b-8070c865f7bd.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... None, 'metadata': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type
array.77.mapping.aaa25b4c-1884-4a9e-a936-36f0efdb906d.message.content.parts.0
  Input should be a valid string [type=string_type, input_value={'content_type': 'image_a... None, 'metadata': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.5/v/string_type

I validated the JSON and then tried to find the referenced JSON. This is array.77.mapping.aaa25b4c-1884-4a9e-a936-36f0efdb906d.message.content.parts.0 referenced in the errors above:

          "content": {
            "content_type": "multimodal_text",
            "parts": [
              {
                "content_type": "image_asset_pointer",
                "asset_pointer": "file-service://file-y42NrJwm6W1iJtsIDHVJ70TP",
                "size_bytes": 30706,
                "width": 512,
                "height": 292,
                "fovea": null,
                "metadata": null
              },
              "Some of the lines..."
              <SNIP!>

I compared that to other XYZ.content.parts.0 and most of them do not have a dictionary, only a string there. The ones with a dictionary were conversations where I uploaded an image for ChatGPT to analyse.

Perhaps search parts for a string instead of assuming a string sits at index 0? I'm not familiar enough with the code to be sure that's a good idea and whether that would cover the DALL-E problem too.

Regards, iain

yb66 commented 1 year ago

It looks like these are the problems for the parts index problem:

Type declaration

parts: list[str] | None = None

Grabbing from index 0

"""Get the text content of the message."""
if self.content.parts is not None:
   return str(self.content.parts[0])

Regards, iain

mmatiaschek commented 8 months ago

I might have the same problem, as commented here: https://github.com/mohamed-chs/chatgpt-history-export-to-md/issues/35#issuecomment-2027052416

Is there a workaround? Thanks!

felipemeres commented 8 months ago

The error is due to the introduction of metadata related to DALL-E image generations. I started working on a new type class for it but haven't had a chance to fully integrate it with all of the functions. In the meantime you can just bypass the errors and export all of the other conversations that don't have the new field.