Closed sergerdn closed 5 months ago
I'm experiencing this error "Object of Type Document is not JSON serializiable." I'm gonna bang my head against the wall until my head stops hurting. thanks.
openai_api_key =
pinecone_key =
pinecone.init(api_key=pinecone_key, environment='us-east1-gcp')
if 'index-test' not in pinecone.list_indexes(): print ('pinecone database not found')
# assign pinecone Index object
index = Pinecone.Index('index-test')
hasVectors = None
# check Index object stats (vector count)
indexStats = index.describe_index_stats()
# print(indexStats)
# assign vector count to variable
num_vectors = indexStats.total_vector_count
# print(num_vectors)
if num_vectors > 0:
# print(f"The 'my-index' index contains {num_vectors} vectors.")
hasVectors = True
else:
# print(f"The index does not contain any vectors")
hasVectors = False
bucket_name = 'text-test-000000010001' file_key = 'Whole30-Slow-Cooker-Freezer-Meal-Plan-from-New-Leaf-Wellness.pdf'
loader = S3FileLoader(bucket_name, file_key)
documents = loader.load()
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 1000,
chunk_overlap = 200,
length_function = len,
)
split_texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model_name="text-embedding-ada-002")
embedded_docs = embeddings.embed_documents(split_texts)
Pinecone.from_documents()
@gallaghercareer
Please provide a code that can reproduce an error for anyone using it. The code should be as minimal as possible. For example, if you are loading files from a remote server, please modify your code to load files from a local folder instead.
Also, please use the backpacks with python
that you posted your code.
https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks
Thanks, I figured out it has something to do with detectron2 ( so far ), as the unstructured reader I'm using for the S3 document loader. Nothing was mentioned in your docs about this package potentially not working...my largest difficulty has been dependencies upon dependencies of issues :/. Image not ready but Pillow works. Etc Etc. I'll let you know how detectron2 on Windows x86 goes...I'm sorry I don't think your docs are very clean or flushed out, use haystack as an example they have a much more .NET Microsoft looking UI/UX. I'll keep this project going but I'm a determined person...
@gallaghercareer
Thank you for sharing your findings regarding the issue you have been facing.
I understand that dealing with dependencies and their dependencies can be challenging, and I appreciate your efforts in troubleshooting the problem. However, I want to clarify that I am not the owner of this project but rather an ordinary user, just like you.
I also appreciate your feedback on documentation. I want to let you know that this project is open source, and anyone can contribute to it, including you. If you feel that our documentation can be improved, I encourage you to update it according to your use case and make a new pull request.
In the meantime, please let me know if you have any further questions or concerns, and I will be happy to assist you.
And the answer is .... The texts[0] is already an instance of the Document class. The print output is showing the default string representation of a Python object, which includes the class name and the memory address at which the object is stored. So there's no need to create a new instance. We can just use texts[0] directly: doc = texts[0]
Now your additions work perfectly for me sergerdn.. Thank you very much :) I vote this to be added to the schema.py file. Thanks again :)
This modification really needs to be adopted, because without it, my use case keeps throwing the error:- Object of type Document is not JSON serializable. How can we push this up the list for adoption?
I also faced this issue when using json.dumps()
to return the Document object from a function. A __to_json__()
function is all we need on the Document object to fix this, but the fastapi.encoders.jsonable_encoder()
solution is a solid workaround that I used to return the Document in JSON.
encountered this error when upgrading from v229 to v235
It works fine at v229
chain = create_structured_output_chain(OrderID, llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"), prompt=prompt)
OUTPUT
{'order_id': 123456789}
when I run this it outputs a dict with the extracted data and now with v235 its print output is
order_id=123456789 <class '__main__.OrderID'>
so I added this
output_dict = {"order_id": run_chain.order_id} print(output_dict)
to turn it back to a dict so that I would not get this error: TypeError: Object of type OrderID is not JSON serializable
This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.
To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:
docs = db.similarity_search(input)
docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.
To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:
docs = db.similarity_search(input) docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
Wow I was stuck on this and even decided to use pinecone client directly but then hit another roadblock. Lucked into finding this thread and this is the only solution that works! Thanks the page_content and json serializing error had me lost.
I don't get it why in one of the only placed where using pydantic made sense (so here), it is not used :(
can we get a fix for this please
This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.
To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:
docs = db.similarity_search(input) docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
Thank you @arash-bizcover. I sense that I might have eventually figured it out, but you saved me from a lot of frustration. This hiccup is just a byproduct of module development that's moving so fast things haven't had time to settle. I'm thankful for communities like GitHub where we can support each other and keep our sanity.
This issue would have never happened in a non-strict-typed environment. Damn, I hate developers.
To stupidly workaround this when using a function that returns lots of Documents (e.x. Chroma) and get a dict from the document, you can do something like this:
docs = db.similarity_search(input) docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
Thanks for this @arash-bizcover!
I suggest using the existing to_json
function, this way if more class members are added to Document in future, the code won't need to be updated (otherwise we only save page_content
and metadata
)
docs_dict = [doc.to_json() for doc in docs]
Then to revive:
from langchain_core.load import load
reloaded_docs = [load(doc) for doc in docs_dict]
Note that to_json
in fact returns a dict, which is then JSON serializable.
I get this with a List of langchain_core.documents.base.Document also
something like this?
import json
from typing import List
# from chromadb import Documents
from langchain_core.documents.base import Document
def pp_json(json_thing, sort=True, indents=4):
if type(json_thing) is str:
print(json.dumps(json.loads(json_thing), sort_keys=sort, indent=indents))
elif type(json_thing) is list and type(json_thing[0] is Document):
# List(Document):
dict = [doc.to_json() for doc in json_thing]
print(json.dumps(dict, sort_keys=sort, indent=indents))
elif type(json_thing) is Document:
dict = json_thing.to_json()
print(json.dumps(dict, sort_keys=sort, indent=indents))
else:
print(json.dumps(json_thing, sort_keys=sort, indent=indents))
return None
# print langchain documents
def pp_docs(docs: List[Document]):
for n, doc in enumerate(docs):
print(f"-- [DOC {n}]\n", doc.page_content)
Any comments would be appreciated.
The issue is that the json module is unable to serialize the Document object, which is a custom class that inherits from BaseModel. The error message specifically says that the Document object is not JSON serializable, meaning it cannot be converted into a JSON string. This is likely because the json module does not know how to serialize the BaseModel class or any of its child classes. To fix the issue, we may need to provide a custom encoder or implement the jsonable_encoder function from the FastAPI library, which is designed to handle pydantic models like BaseModel.
Possible fixes:
Another approach:
Do we need an API like
doc.to_json()
or/anddoc.to_dict()
? Because in this case it will hide the details of model realization from the end user.