pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.57k stars 17.56k forks source link

BUG: DataFrame to JSON failed when it with UUID #59132

Open grieve54706 opened 3 days ago

grieve54706 commented 3 days ago

Pandas version checks

Reproducible Example

import uuid
import pandas as pd

pd.DataFrame({"uuid": [uuid.uuid4()]}).to_json()

Issue Description

If the DataFrame is with UUID, it will fail when to JSON. And raise the error with the message Unsupported UTF-8 sequence length when encoding string or UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 183: invalid start byte.

Expected Behavior

It should serialize uuid.UUID instances to RFC 4122 format, e.g., f81d4fae-7dec-11d0-a765-00a0c91e6bf6.

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.0.final.0 python-bits : 64 OS : Darwin OS-release : 23.4.0 Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:41 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8103 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 2.2.2 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 setuptools : 70.1.1 pip : 24.0 Cython : None pytest : 8.2.0 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : 1.4.6 psycopg2 : 2.9.9 jinja2 : 3.1.4 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.30 tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None None
asishm commented 2 days ago

You can achieve this using the default_handler in df.to_json.

pd.DataFrame({"uuid": [uuid.uuid4()]}).to_json(default_handler=str).

Also the stdlib json library (pandas uses a vendored version of ujson iirc) also doesn't serialize uuids natively.

import json
import uuid
json.dumps({"a": uuid.uuid4()}) # raises with TypeError: Object of type UUID is not JSON serializable
json.dumps({"a": uuid.uuid4()}, default=str) # works
Siddharth-Latthe-07 commented 21 hours ago

@grieve54706 The issue you are encountering arises because pandas does not natively support serialization of uuid.UUID instances to JSON. When you attempt to serialize a DataFrame containing UUID objects using to_json(), it results in encoding errors. to produce the expected behaviour u can try out this:- convert the UUID objects to their string representations before serializing the DataFrame to JSON.

import uuid
import pandas as pd

# Create a DataFrame with a UUID column
df = pd.DataFrame({"uuid": [uuid.uuid4()]})

# Convert UUID objects to strings
df['uuid'] = df['uuid'].astype(str)

# Serialize the DataFrame to JSON
json_data = df.to_json()
print(json_data)

Plz let me know if the above works thanks

grieve54706 commented 2 hours ago

Thanks, guys. I think your suggestions all work.

I provide a tool to connect many databases and put the data into pandas for other people, so I will not know which column is UUID and databases have different data that could be dtype Object too. I found the orjson serializes UUID to the string by default. Curious, pandas to JSON should follow RFC 4122 too?