langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.11k stars 14.97k forks source link

TypeError: Object of type Document is not JSON serializable #2222

Closed sergerdn closed 5 months ago

sergerdn commented 1 year ago

Any comments would be appreciated.

The issue is that the json module is unable to serialize the Document object, which is a custom class that inherits from BaseModel. The error message specifically says that the Document object is not JSON serializable, meaning it cannot be converted into a JSON string. This is likely because the json module does not know how to serialize the BaseModel class or any of its child classes. To fix the issue, we may need to provide a custom encoder or implement the jsonable_encoder function from the FastAPI library, which is designed to handle pydantic models like BaseModel.

def query_chromadb():
    client_settings = chromadb.config.Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory=DB_DIR,
        anonymized_telemetry=False
    )

    embeddings = OpenAIEmbeddings()

    vectorstore = Chroma(
        collection_name="langchain_store",
        embedding_function=embeddings,
        client_settings=client_settings,
        persist_directory=DB_DIR,
    )

    result = vectorstore.similarity_search_with_score(query="FREDERICK", k=4)
    print(result)
    print(json.dumps(result, indent=4, sort_keys=False)) // ERROR

def main():
    # init_chromadb()
    query_chromadb()
import json
from pydantic import BaseModel, Field

class Document(BaseModel):
    """Interface for interacting with a document."""

    page_content: str
    metadata: dict = Field(default_factory=dict)

doc = Document(page_content="Some page content", metadata={"author": "John Doe"})

print(json.dumps(doc)) // ERROR

Possible fixes:

import json

from pydantic import BaseModel, Field

class Document(BaseModel):
    """Interface for interacting with a document."""

    page_content: str
    metadata: dict = Field(default_factory=dict)

    def to_dict(self):
        return self.dict(by_alias=True, exclude_unset=True) # just an example!

    def to_json(self):
        return self.json(by_alias=True, exclude_unset=True) # just an example!

doc = Document(page_content="Some page content", metadata={"author": "John Doe"})

# Convert to dictionary and serialize
doc_dict = doc.to_dict()
doc_json = json.dumps(doc.to_dict())

## {"page_content": "Some page content", "metadata": {"author": "John Doe"}}
print(doc_json)

# Or use the custom to_json() method
doc_json = doc.to_json()
## {"page_content": "Some page content", "metadata": {"author": "John Doe"}}
print(doc_json)

Another approach:

import json
from fastapi.encoders import jsonable_encoder
from pydantic import BaseModel, Field

class Document(BaseModel):
    """Interface for interacting with a document."""

    page_content: str
    metadata: dict = Field(default_factory=dict)

doc = Document(page_content="Some page content", metadata={"author": "John Doe"})

print(json.dumps( jsonable_encoder(doc), indent=4))

Do we need an API like doc.to_json() or/and doc.to_dict()? Because in this case it will hide the details of model realization from the end user.

gallaghercareer commented 1 year ago

I'm experiencing this error "Object of Type Document is not JSON serializiable." I'm gonna bang my head against the wall until my head stops hurting. thanks.

openai api key

openai_api_key =

Connect to Pinecone API

pinecone_key =

pinecone.init(api_key=pinecone_key, environment='us-east1-gcp')

check if index database already exists (only create index if it doesn't)

if 'index-test' not in pinecone.list_indexes(): print ('pinecone database not found')

# assign pinecone Index object
index = Pinecone.Index('index-test')
hasVectors = None

# check Index object stats (vector count)

indexStats = index.describe_index_stats()

# print(indexStats)

# assign vector count to variable

num_vectors = indexStats.total_vector_count

# print(num_vectors)

if num_vectors > 0:

    # print(f"The 'my-index' index contains {num_vectors} vectors.")
    hasVectors = True
else:

    # print(f"The index does not contain any vectors")
    hasVectors = False

bucket_name = 'text-test-000000010001' file_key = 'Whole30-Slow-Cooker-Freezer-Meal-Plan-from-New-Leaf-Wellness.pdf'

use the document loader

loader = S3FileLoader(bucket_name, file_key)

load the document from the loader

documents = loader.load()

split the document into text chunks

text_splitter = CharacterTextSplitter(
separator = "\n\n", chunk_size = 1000, chunk_overlap = 200, length_function = len, ) split_texts = text_splitter.split_documents(documents)

print(texts)

call openai wrapper class

embeddings = OpenAIEmbeddings(model_name="text-embedding-ada-002")

embed the texts from the s3 bucket using openai wrapper class

embedded_docs = embeddings.embed_documents(split_texts)

push up to pinecone

Pinecone.from_documents()

sergerdn commented 1 year ago

@gallaghercareer

Please provide a code that can reproduce an error for anyone using it. The code should be as minimal as possible. For example, if you are loading files from a remote server, please modify your code to load files from a local folder instead.

Also, please use the backpacks with python that you posted your code. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks

gallaghercareer commented 1 year ago

Thanks, I figured out it has something to do with detectron2 ( so far ), as the unstructured reader I'm using for the S3 document loader. Nothing was mentioned in your docs about this package potentially not working...my largest difficulty has been dependencies upon dependencies of issues :/. Image not ready but Pillow works. Etc Etc. I'll let you know how detectron2 on Windows x86 goes...I'm sorry I don't think your docs are very clean or flushed out, use haystack as an example they have a much more .NET Microsoft looking UI/UX. I'll keep this project going but I'm a determined person...

sergerdn commented 1 year ago

@gallaghercareer

Thank you for sharing your findings regarding the issue you have been facing.

I understand that dealing with dependencies and their dependencies can be challenging, and I appreciate your efforts in troubleshooting the problem. However, I want to clarify that I am not the owner of this project but rather an ordinary user, just like you.

I also appreciate your feedback on documentation. I want to let you know that this project is open source, and anyone can contribute to it, including you. If you feel that our documentation can be improved, I encourage you to update it according to your use case and make a new pull request.

In the meantime, please let me know if you have any further questions or concerns, and I will be happy to assist you.

drew1two commented 1 year ago

And the answer is .... The texts[0] is already an instance of the Document class. The print output is showing the default string representation of a Python object, which includes the class name and the memory address at which the object is stored. So there's no need to create a new instance. We can just use texts[0] directly: doc = texts[0]
Now your additions work perfectly for me sergerdn.. Thank you very much :) I vote this to be added to the schema.py file. Thanks again :)

drew1two commented 1 year ago

This modification really needs to be adopted, because without it, my use case keeps throwing the error:- Object of type Document is not JSON serializable. How can we push this up the list for adoption?

Gbillington1 commented 1 year ago

I also faced this issue when using json.dumps() to return the Document object from a function. A __to_json__() function is all we need on the Document object to fix this, but the fastapi.encoders.jsonable_encoder() solution is a solid workaround that I used to return the Document in JSON.

itsjustmeemman commented 1 year ago

encountered this error when upgrading from v229 to v235 It works fine at v229 chain = create_structured_output_chain(OrderID, llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"), prompt=prompt)

OUTPUT {'order_id': 123456789}

when I run this it outputs a dict with the extracted data and now with v235 its print output is order_id=123456789 <class '__main__.OrderID'>

so I added this output_dict = {"order_id": run_chain.order_id} print(output_dict)

to turn it back to a dict so that I would not get this error: TypeError: Object of type OrderID is not JSON serializable

arash-bizcover commented 1 year ago

This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.

To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:

        docs = db.similarity_search(input)
        docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
rmnegatives commented 12 months ago

This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.

To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:

        docs = db.similarity_search(input)
        docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]

Wow I was stuck on this and even decided to use pinecone client directly but then hit another roadblock. Lucked into finding this thread and this is the only solution that works! Thanks the page_content and json serializing error had me lost.

FrancescoSaverioZuppichini commented 11 months ago

I don't get it why in one of the only placed where using pydantic made sense (so here), it is not used :(

OlajideOgun commented 9 months ago

can we get a fix for this please

edbock commented 9 months ago

This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.

To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:

        docs = db.similarity_search(input)
        docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]

Thank you @arash-bizcover. I sense that I might have eventually figured it out, but you saved me from a lot of frustration. This hiccup is just a byproduct of module development that's moving so fast things haven't had time to settle. I'm thankful for communities like GitHub where we can support each other and keep our sanity.

mribbons commented 8 months ago

This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.

To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:

        docs = db.similarity_search(input)
        docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]

Thanks for this @arash-bizcover! I suggest using the existing to_json function, this way if more class members are added to Document in future, the code won't need to be updated (otherwise we only save page_content and metadata)

docs_dict = [doc.to_json() for doc in docs]

Then to revive:

from langchain_core.load import load
reloaded_docs = [load(doc) for doc in docs_dict]

Note that to_json in fact returns a dict, which is then JSON serializable.

dcsan commented 4 months ago

I get this with a List of langchain_core.documents.base.Document also

something like this?

import json
from typing import List

# from chromadb import Documents
from langchain_core.documents.base import Document

def pp_json(json_thing, sort=True, indents=4):

    if type(json_thing) is str:
        print(json.dumps(json.loads(json_thing), sort_keys=sort, indent=indents))
    elif type(json_thing) is list and type(json_thing[0] is Document):
        # List(Document):
        dict = [doc.to_json() for doc in json_thing]
        print(json.dumps(dict, sort_keys=sort, indent=indents))

    elif type(json_thing) is Document:
        dict = json_thing.to_json()
        print(json.dumps(dict, sort_keys=sort, indent=indents))
    else:
        print(json.dumps(json_thing, sort_keys=sort, indent=indents))
    return None

# print langchain documents
def pp_docs(docs: List[Document]):
    for n, doc in enumerate(docs):
        print(f"-- [DOC {n}]\n", doc.page_content)