neo4j-labs / llm-graph-builder

Neo4j graph construction from unstructured data
Apache License 2.0
369 stars 93 forks source link

Getting error while processing the document. #317

Open chinmaydeshpandecaxsol opened 1 month ago

chinmaydeshpandecaxsol commented 1 month ago

Log:

2024-05-16 11:39:23 2024-05-16 06:09:23,882 - Creating graph database connection to URI: neo4j+s://e6589783.databases.neo4j.io:7687, database: neo4j 2024-05-16 11:39:33 2024-05-16 06:09:33,018 - Successfully created graph database connection 2024-05-16 11:39:33 2024-05-16 06:09:33,019 - Process file name :new_york_city_example_itinerary.pdf 2024-05-16 11:39:33 2024-05-16 06:09:33,019 - File path:/code/src/merged_files/new_york_city_example_itinerary.pdf 2024-05-16 11:39:33 2024-05-16 06:09:33,020 - file new_york_city_example_itinerary.pdf processing 2024-05-16 11:39:33 2024-05-16 06:09:33,368 - Break down file into chunks 2024-05-16 11:39:33 2024-05-16 06:09:33,369 - Split file into smaller chunks 2024-05-16 11:39:34 2024-05-16 06:09:34,100 - No of chunks created from document 15 2024-05-16 11:39:34 2024-05-16 06:09:34,100 - new_york_city_example_itinerary.pdf 2024-05-16 11:39:34 2024-05-16 06:09:34,100 - <src.entities.source_node.sourceNode object at 0x7f4fea554730> 2024-05-16 11:39:34 2024-05-16 06:09:34,100 - Update source node properties 2024-05-16 11:39:34 2024-05-16 06:09:34,204 - Update the status as Processing 2024-05-16 11:39:34 2024-05-16 06:09:34,204 - creating FIRST_CHUNK and NEXT_CHUNK relationships between chunks 2024-05-16 11:39:34 2024-05-16 06:09:34,568 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2 2024-05-16 11:39:42 2024-05-16 06:09:42,490 - Use pytorch device_name: cpu 2024-05-16 11:39:42 2024-05-16 06:09:42,493 - Embedding: Using SentenceTransformer , Dimension:384 2024-05-16 11:39:42 2024-05-16 06:09:42,493 - embedding model:client=SentenceTransformer( 2024-05-16 11:39:42 (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 2024-05-16 11:39:42 (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) 2024-05-16 11:39:42 (2): Normalize() 2024-05-16 11:39:42 ) model_name='all-MiniLM-L6-v2' cache_folder=None model_kwargs={} encode_kwargs={} multi_process=False show_progress=False and dimesion:384 2024-05-16 11:39:42 2024-05-16 06:09:42,493 - update embedding and vector index for chunks 2024-05-16 11:39:46 2024-05-16 06:09:46,774 - {"message": " Failed To Process File:new_york_city_example_itinerary.pdf or LLM Unable To Parse Content", "error_message": "'NoneType' object has no attribute 'upper'", "file_name": "new_york_city_example_itinerary.pdf", "status": "Failed", "url": "neo4j+s://e6589783.databases.neo4j.io:7687", "failed_count": 1, "source_type": "local file"} 2024-05-16 11:39:46 2024-05-16 06:09:46,774 - File Failed in extraction: {'message': ' Failed To Process File:new_york_city_example_itinerary.pdf or LLM Unable To Parse Content', 'error_message': "'NoneType' object has no attribute 'upper'", 'file_name': 'new_york_city_example_itinerary.pdf', 'status': 'Failed', 'url': 'neo4j+s://e6589783.databases.neo4j.io:7687', 'failed_count': 1, 'source_type': 'local file'} 2024-05-16 11:39:46 Traceback (most recent call last): 2024-05-16 11:39:46 File "/code/score.py", line 137, in extract_knowledge_graph_from_file 2024-05-16 11:39:46 result = await asyncio.to_thread( 2024-05-16 11:39:46 File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread 2024-05-16 11:39:46 return await loop.run_in_executor(None, func_call) 2024-05-16 11:39:46 File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-05-16 11:39:46 result = self.fn(*self.args, *self.kwargs) 2024-05-16 11:39:46 File "/code/src/main.py", line 159, in extract_graph_from_file_local_file 2024-05-16 11:39:46 return processing_source(graph, model, file_name, pages, allowedNodes, allowedRelationship, merged_file_path) 2024-05-16 11:39:46 File "/code/src/main.py", line 253, in processing_source 2024-05-16 11:39:46 update_embedding_create_vector_index( graph, chunkId_chunkDoc_list, file_name) 2024-05-16 11:39:46 File "/code/src/make_relationships.py", line 52, in update_embedding_create_vector_index 2024-05-16 11:39:46 if isEmbedding.upper() == "TRUE": 2024-05-16 11:39:46 AttributeError: 'NoneType' object has no attribute 'upper' 2024-05-16 11:39:46 INFO: 172.19.0.1:39628 - "POST /extract HTTP/1.1" 200 OK 2024-05-16 11:39:46 2024-05-16 06:09:46,802 - Creating graph database connection to URI: neo4j+s://e6589783.databases.neo4j.io:7687, database: neo4j 2024-05-16 11:39:54 2024-05-16 06:09:54,442 - Successfully created graph database connection 2024-05-16 11:39:54 2024-05-16 06:09:54,554 - Exception in update KNN graph:list index out of range 2024-05-16 11:39:54 Traceback (most recent call last): 2024-05-16 11:39:54 File "/code/score.py", line 195, in update_similarity_graph 2024-05-16 11:39:54 result = await asyncio.to_thread(update_graph, graph) 2024-05-16 11:39:54 File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread 2024-05-16 11:39:54 return await loop.run_in_executor(None, func_call) 2024-05-16 11:39:54 File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-05-16 11:39:54 result = self.fn(self.args, **self.kwargs) 2024-05-16 11:39:54 File "/code/src/main.py", line 342, in update_graph 2024-05-16 11:39:54 return graph_DB_dataAccess.update_KNN_graph() 2024-05-16 11:39:54 File "/code/src/graphDB_dataAccess.py", line 88, in update_KNN_graph 2024-05-16 11:39:54 if index[0]['name'] == 'vector': 2024-05-16 11:39:54 IndexError: list index out of range 2024-05-16 11:39:54 INFO: 172.19.0.1:39628 - "POST /update_similarity_graph HTTP/1.1" 200 OK

chinmaydeshpandecaxsol commented 1 month ago

Hi @jexp need help here.

jexp commented 1 month ago

@chinmaydeshpandecaxsol can you share the example file?

chinmaydeshpandecaxsol commented 1 month ago

new_york_city_example_itinerary.pdf

@jexp I have used this file.

Thanks for your response. This error has been solved in DEV branch.

chinmaydeshpandecaxsol commented 1 month ago

@jexp there is new error in the DEV branch, it is not able to get the OPENAI_API_KEY from the .env file. The key is provided in the .env file. @aashipandya Please check the following log:

backend | 2024-05-17 04:45:04,642 - Process file name :new_york_city_example_itinerary.pdf backend | 2024-05-17 04:45:04,642 - File path:/code/src/merged_files/new_york_city_example_itinerary.pdf backend | 2024-05-17 04:45:04,643 - file new_york_city_example_itinerary.pdf processing backend | 2024-05-17 04:45:04,961 - Break down file into chunks backend | 2024-05-17 04:45:04,961 - Split file into smaller chunks backend | 2024-05-17 04:45:21,572 - No of chunks created from document 15 backend | 2024-05-17 04:45:21,572 - new_york_city_example_itinerary.pdf backend | 2024-05-17 04:45:21,572 - <src.entities.source_node.sourceNode object at 0x7f5621ae1a20>
backend | 2024-05-17 04:45:21,572 - Update source node properties backend | 2024-05-17 04:45:25,776 - Update the status as Processing
backend | 2024-05-17 04:45:25,776 - creating FIRST_CHUNK and NEXT_CHUNK relationships between chunks backend | 2024-05-17 04:45:26,082 - Embedding status is TRUE for file: new_york_city_example_itinerary.pdf backend | 2024-05-17 04:45:26,082 - Get graph document list from models backend | 2024-05-17 04:45:26,082 - allowedNodes: [], allowedRelationship: []
backend | 2024-05-17 04:45:26,082 - Combining 10 chunks before sending request to LLM backend | 2024-05-17 04:45:26,183 - File Failed in extraction: {'message': ' Failed To Process File:new_york_city_example_itinerary.pdf or LLM Unable To Parse Content', 'error_message': '1 validation error for ChatOpenAI\nroot\n Did not find openai_api_key, please add an environment variable OPENAI_API_KEY which contains it, or pass openai_api_key as a named parameter. (type=value_error)', 'file_name': 'new_york_city_example_itinerary.pdf', 'status': 'Failed', 'db_url': 'neo4j+s://e6589783.databases.neo4j.io:7687', 'failed_count': 1, 'source_type': 'local file'} backend | Traceback (most recent call last): backend | File "/code/score.py", line 138, in extract_knowledge_graph_from_file backend | result = await asyncio.to_thread( backend | File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread backend | return await loop.run_in_executor(None, func_call) backend | File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run backend | result = self.fn(*self.args, **self.kwargs) backend | File "/code/src/main.py", line 159, in extract_graph_from_file_local_file backend | return processing_source(graph, model, file_name, pages, allowedNodes, allowedRelationship, merged_file_path) backend | File "/code/src/main.py", line 360, in processing_source backend | graph_documents = generate_graphDocuments(model, graph, chunkId_chunkDoc_list, allowedNodes, allowedRelationship) backend | File "/code/src/generate_graphDocuments_from_llm.py", line 34, in generate_graphDocuments backend | graph_documents = get_graph_from_OpenAI(model_version, graph, chunkId_chunkDoc_list, allowedNodes, allowedRelationship) backend | File "/code/src/openAI_llm.py", line 466, in get_graph_from_OpenAI backend | llm = ChatOpenAI(model= model_version, temperature=0) backend | File "/usr/local/lib/python3.10/site-packages/pydantic/v1/main.py", line 341, in init backend | raise validation_error backend | pydantic.v1.error_wrappers.ValidationError: 1 validation error for ChatOpenAI backend | root backend | Did not find openai_api_key, please add an environment variable OPENAI_API_KEY which contains it, or pass openai_api_key as a named parameter. (type=value_error)

luzidl commented 1 month ago

Similar issue here:

backend | 2024-06-04 14:02:14,323 - Chunk File Path: /code/chunks/AnaCredit1.pdf_part_1 backend | 2024-06-04 14:02:14,338 - Merged File Path: /code/merged_files backend | 2024-06-04 14:02:14,338 - Chunk File Path While Merging Parts:/code/chunks/AnaCredit1.pdf_part_1 backend | 2024-06-04 14:02:14,339 - Chunks merged successfully and return file size backend | 2024-06-04 14:02:14,339 - File merged successfully backend | 2024-06-04 14:02:14,339 - creating source node if does not exist backend | 2024-06-04 14:02:14,371 - closing connection for upload api backend | 2024-06-04 14:02:19,801 - File path:/code/merged_files/AnaCredit1.pdf backend | 2024-06-04 14:02:19,802 - Process file name :AnaCredit1.pdf backend | 2024-06-04 14:02:19,803 - file AnaCredit1.pdf processing backend | 2024-06-04 14:02:22,520 - Break down file into chunks backend | 2024-06-04 14:02:22,520 - Split file into smaller chunks backend | 2024-06-04 14:02:22,680 - AnaCredit1.pdf backend | 2024-06-04 14:02:22,680 - <src.entities.source_node.sourceNode object at 0xffff1d5f3e80> backend | 2024-06-04 14:02:22,680 - Update source node properties backend | 2024-06-04 14:02:22,705 - Update the status as Processing backend | 2024-06-04 14:02:22,762 - file AnaCredit1.pdf deleted successfully backend | 2024-06-04 14:02:22,762 - File Failed in extraction: {'message': 'Failed To Process File:AnaCredit1.pdf or LLM Unable To Parse Content ', 'error_message': "int() argument must be a string, a bytes-like object or a real number, not 'NoneType'", 'file_name': 'AnaCredit1.pdf', 'status': 'Failed', 'db_url': 'neo4j+s://0e3a48d1.databases.neo4j.io:7687', 'failed_count': 1, 'source_type': 'local file'} backend | Traceback (most recent call last): backend | File "/code/score.py", line 163, in extract_knowledge_graph_from_file backend | result = await asyncio.to_thread( backend | File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread backend | return await loop.run_in_executor(None, func_call) backend | File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run backend | result = self.fn(*self.args, *self.kwargs) backend | File "/code/src/main.py", line 162, in extract_graph_from_file_local_file backend | return processing_source(graph, model, file_name, pages, allowedNodes, allowedRelationship, merged_file_path) backend | File "/code/src/main.py", line 252, in processing_source backend | update_graph_chunk_processed = int(os.environ.get('UPDATE_GRAPH_CHUNKS_PROCESSED')) backend | TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType' backend | 2024-06-04 14:02:22,762 - closing connection for extract api backend | 2024-06-04 14:02:23,021 - Exception in update KNN graph:list index out of range backend | Traceback (most recent call last): backend | File "/code/score.py", line 228, in update_similarity_graph backend | result = await asyncio.to_thread(update_graph, graph) backend | File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread backend | return await loop.run_in_executor(None, func_call) backend | File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run backend | result = self.fn(self.args, **self.kwargs) backend | File "/code/src/main.py", line 375, in update_graph backend | return graph_DB_dataAccess.update_KNN_graph() backend | File "/code/src/graphDB_dataAccess.py", line 128, in update_KNN_graph backend | if index[0]['name'] == 'vector': backend | IndexError: list index out of range backend | 2024-06-04 14:02:23,021 - closing connection for update_similarity_graph api backend | 2024-06-04 14:02:23,493 - Request disconnected