google-gemini / generative-ai-python

The official Python library for the Google Gemini API
https://pypi.org/project/google-generativeai/
Apache License 2.0
1.34k stars 264 forks source link

GoogleGenerativeAIError: Error embedding content: 'utf-8' codec can't encode character '\ud835' in position 897: surrogates not allowed #212

Open NitishPal2013 opened 6 months ago

NitishPal2013 commented 6 months ago

Description of the bug:

This error occurs when the Google Generative AI model cannot encode the character '\ud835' in position 897 of the input text. This is because the character '\ud835' is a surrogate character, and surrogate characters are not allowed in UTF-8 encoding.

Actual vs expected behavior:

The Embedding should be completed without any error or it should skip these special characters as these are not that valuable as the text only.

Any other information you'd like to share?

The full error i got is this:


UnicodeEncodeError Traceback (most recent call last) File u:\GENAI_Gemini\venv\Lib\site-packages\langchain_google_genai\embeddings.py:79, in GoogleGenerativeAIEmbeddings._embed(self, texts, task_type, title) 78 try: ---> 79 result = genai.embed_content( 80 model=self.model, 81 content=texts, 82 task_type=task_type, 83 title=title, 84 ) 85 except Exception as e:

File u:\GENAI_Gemini\venv\Lib\site-packages\google\generativeai\embedding.py:154, in embed_content(model, content, task_type, title, client) 148 requests = ( 149 glm.EmbedContentRequest( 150 model=model, content=content_types.to_content(c), task_type=task_type, title=title 151 ) 152 for c in content 153 ) --> 154 for batch in _batched(requests, EMBEDDING_MAX_BATCH_SIZE): 155 embedding_request = glm.BatchEmbedContentsRequest(model=model, requests=batch)

File u:\GENAI_Gemini\venv\Lib\site-packages\google\generativeai\embedding.py:150, in (.0) 147 result = {"embedding": []} 148 requests = ( 149 glm.EmbedContentRequest( --> 150 model=model, content=content_types.to_content(c), task_type=task_type, title=title 151 ) 152 for c in content 153 ) 154 for batch in _batched(requests, EMBEDDING_MAX_BATCH_SIZE):

File u:\GENAI_Gemini\venv\Lib\site-packages\google\generativeai\types\content_types.py:205, in to_content(content) 203 else: 204 # Maybe this is a Part? --> 205 return glm.Content(parts=[to_part(content)])

File u:\GENAI_Gemini\venv\Lib\site-packages\google\generativeai\types\content_types.py:169, in to_part(part) 168 elif isinstance(part, str): --> 169 return glm.Part(text=part) 170 else: 171 # Maybe it can be turned into a blob?

File u:\GENAI_Gemini\venv\Lib\site-packages\proto\message.py:615, in Message.init(self, mapping, ignore_unknown_fields, kwargs) 614 # Create the internal protocol buffer. --> 615 super().setattr("_pb", self._meta.pb(params))

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud835' in position 897: surrogates not allowed

The above exception was the direct cause of the following exception:

GoogleGenerativeAIError Traceback (most recent call last) Cell In[52], line 1 ----> 1 docSearch = Pinecone.from_documents(splits, GoogleGeminiEmbeddings, index_name="deeplearning",)

File u:\GENAI_Gemini\venv\Lib\site-packages\langchain_core\vectorstores.py:508, in VectorStore.from_documents(cls, documents, embedding, kwargs) 506 texts = [d.page_content for d in documents] 507 metadatas = [d.metadata for d in documents] --> 508 return cls.from_texts(texts, embedding, metadatas=metadatas, kwargs)

File u:\GENAI_Gemini\venv\Lib\site-packages\langchain_pinecone\vectorstores.py:434, in Pinecone.from_texts(cls, texts, embedding, metadatas, ids, batch_size, text_key, namespace, index_name, upsert_kwargs, pool_threads, embeddings_chunk_size, kwargs) 431 pinecone_index = cls.get_pinecone_index(index_name, pool_threads) 432 pinecone = cls(pinecone_index, embedding, text_key, namespace, kwargs) --> 434 pinecone.add_texts( 435 texts, 436 metadatas=metadatas, 437 ids=ids, 438 namespace=namespace, 439 batch_size=batch_size, 440 embedding_chunk_size=embeddings_chunk_size, 441 **(upsert_kwargs or {}), 442 ) 443 return pinecone

File u:\GENAI_Gemini\venv\Lib\site-packages\langchain_pinecone\vectorstores.py:154, in Pinecone.add_texts(self, texts, metadatas, ids, namespace, batch_size, embedding_chunk_size, async_req, **kwargs) 152 chunk_ids = ids[i : i + embedding_chunk_size] 153 chunk_metadatas = metadatas[i : i + embedding_chunk_size] --> 154 embeddings = self._embedding.embed_documents(chunk_texts) 155 async_res = [ 156 self._index.upsert( 157 vectors=batch, (...) 164 ) 165 ] 166 [res.get() for res in async_res]

File u:\GENAI_Gemini\venv\Lib\site-packages\langchain_google_genai\embeddings.py:103, in GoogleGenerativeAIEmbeddings.embed_documents(self, texts, batch_size) 92 """Embed a list of strings. Vertex AI currently 93 sets a max batch size of 5 strings. 94 (...) 100 List of embeddings, one for each text. 101 """ 102 task_type = self.task_type or "retrieval_document" --> 103 return self._embed(texts, task_type=task_type)

File u:\GENAI_Gemini\venv\Lib\site-packages\langchain_google_genai\embeddings.py:86, in GoogleGenerativeAIEmbeddings._embed(self, texts, task_type, title) 79 result = genai.embed_content( 80 model=self.model, 81 content=texts, 82 task_type=task_type, 83 title=title, 84 ) 85 except Exception as e: ---> 86 raise GoogleGenerativeAIError(f"Error embedding content: {e}") from e 87 return result["embedding"]

GoogleGenerativeAIError: Error embedding content: 'utf-8' codec can't encode character '\ud835' in position 897: surrogates not allowed

eamag commented 2 months ago

Same error, can be fixed by prompt.replace('\ud835', ''), but it should be fixed automatically