neo4j-labs / llm-graph-builder

Neo4j graph construction from unstructured data
Apache License 2.0
358 stars 91 forks source link

Trying to solve this error: TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType' #414

Closed aneeshsathe closed 3 weeks ago

aneeshsathe commented 3 weeks ago

Hi I've been trying for a few weeks now, but keep running into the error trace below.

I'm assuming there is some error in my config and that the files can't be read or passed. I get this error both for local files and for wikipedia. The Wikipedia thing creates a file, but can't create graph.

Error Trace

backend   | 2024-06-11 02:34:33,314 - wikipedia query ids = ['dermatitis']
backend   | 2024-06-11 02:34:33,314 - Creating source node for dermatitis, en
backend   | 2024-06-11 02:34:36,003 - creating source node if does not exist
backend   | 2024-06-11 02:34:36,122 - closing connection for url/scan api
backend   | 2024-06-11 02:34:41,955 - Total Pages from Wikipedia = 1
backend   | 2024-06-11 02:34:42,020 - Break down file into chunks
backend   | 2024-06-11 02:34:42,022 - Split file into smaller chunks
backend   | 2024-06-11 02:34:42,517 - dermatitis
backend   | 2024-06-11 02:34:42,517 - <src.entities.source_node.sourceNode object at 0xffff2001ca00>
backend   | 2024-06-11 02:34:42,517 - Update source node properties
backend   | 2024-06-11 02:34:42,580 - Update the status as Processing
backend   | 2024-06-11 02:34:42,731 - File Failed in extraction: {'message': 'Failed To Process File:dermatitis or LLM Unable To Parse Content ', 'error_message': "int() argument must be a string, a bytes-like object or a real number, not 'NoneType'", 'file_name': 'dermatitis', 'status': 'Failed', 'db_url': '[redacted]', 'failed_count': 1, 'source_type': 'Wikipedia'}
backend   | Traceback (most recent call last):
backend   |   File "/code/score.py", line 175, in extract_knowledge_graph_from_file
backend   |     result = await asyncio.to_thread(
backend   |   File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread
backend   |     return await loop.run_in_executor(None, func_call)
backend   |   File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
backend   |     result = self.fn(*self.args, **self.kwargs)
backend   |   File "/code/src/main.py", line 197, in extract_graph_from_file_Wikipedia
backend   |     return processing_source(graph, model, file_name, pages, allowedNodes, allowedRelationship)
backend   |   File "/code/src/main.py", line 253, in processing_source
backend   |     update_graph_chunk_processed = int(os.environ.get('UPDATE_GRAPH_CHUNKS_PROCESSED'))
backend   | TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
backend   | 2024-06-11 02:34:42,733 - closing connection for extract api
backend   | 2024-06-11 02:34:43,452 - Exception in update KNN graph:list index out of range
backend   | Traceback (most recent call last):
backend   |   File "/code/score.py", line 228, in update_similarity_graph
backend   |     result = await asyncio.to_thread(update_graph, graph)
backend   |   File "/usr/local/lib/python3.10/asyncio/threads.py", line 25, in to_thread
backend   |     return await loop.run_in_executor(None, func_call)
backend   |   File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
backend   |     result = self.fn(*self.args, **self.kwargs)
backend   |   File "/code/src/main.py", line 376, in update_graph
backend   |     return graph_DB_dataAccess.update_KNN_graph()
backend   |   File "/code/src/graphDB_dataAccess.py", line 129, in update_KNN_graph
backend   |     if index[0]['name'] == 'vector':
backend   | IndexError: list index out of range

Backend Dockerfile:

FROM python:3.10
WORKDIR /code
ENV PORT 8000
EXPOSE 8000
COPY . /code
RUN apt-get update \
    && apt-get install -y libgl1-mesa-glx cmake \
    && apt-get install -y poppler-utils \
    && apt install -y tesseract-ocr \
    && export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH \
    && pip install --no-cache-dir --upgrade -r /code/requirements.txt

# CMD ["uvicorn", "score:app", "--host", "0.0.0.0", "--port", "8000","--workers", "4"]
CMD ["gunicorn", "score:app","--workers","4","--worker-class","uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "300"]

Frontend Dockerfile:

# Step 1: Build the React application
FROM node:20 AS build
# ENV BACKEND_API_URL "https://dev-backend-dcavk67s4a-uc.a.run.app"
# ENV BACKEND_PROCESSING_URL "https://dev-processing-backend-dcavk67s4a-uc.a.run.app"
#ENV BLOOM_URL "https://bloom-latest.s3.eu-west-2.amazonaws.com/assets/index.html?connectURL={CONNECT_URL}&search=Show+me+a+graph"
ENV BLOOM_URL "https://workspace-preview.neo4j.io/workspace/explore?connectURL={CONNECT_URL}&search=Show+me+a+graph&featureGenAISuggestions=true&featureGenAISuggestionsInternal=true"
ENV REACT_APP_SOURCES ""
ENV LLM_MODELS ""
ENV TIME_PER_CHUNK 4
ENV ENV "DEV"
ENV GOOGLE_CLIENT_ID "[redacted]"
ENV CHUNK_SIZE 5242880
WORKDIR /app
COPY package.json yarn.lock ./
RUN yarn add @neo4j-nvl/base @neo4j-nvl/react
RUN yarn install
COPY . ./
RUN yarn run build

# Step 2: Serve the application using Nginx
FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY nginx/nginx.conf /etc/nginx/conf.d/default.conf

EXPOSE 8080
CMD ["nginx", "-g", "daemon off;"]

Backend .env

OPENAI_API_KEY = "[redacted]"
DIFFBOT_API_KEY = "[redacted]"
NEO4J_URI = "[redacted]"
NEO4J_USERNAME = "[redacted]"
NEO4J_PASSWORD = "[redacted]"
NEO4J_DATABASE = "neo4j"
AWS_ACCESS_KEY_ID =  ""
AWS_SECRET_ACCESS_KEY = ""
EMBEDDING_MODEL = "text-embedding-ada-002" # I have tried "openai" here but same error
IS_EMBEDDING = "TRUE"
KNN_MIN_SCORE = ""
LANGCHAIN_API_KEY = "[redacted]"
LANGCHAIN_PROJECT = "default"
LANGCHAIN_TRACING_V2 = ""
LANGCHAIN_ENDPOINT = ""
NUMBER_OF_CHUNKS_TO_COMBINE = ""
# NUMBER_OF_CHUNKS_ALLOWED = ""
# Enable Gemini (default is True)
GEMINI_ENABLED = True|False
# Enable Google Cloud logs (default is True)
GCP_LOG_METRICS_ENABLED = True|False

Frontend .env

BACKEND_API_URL="http://localhost:8000"
BLOOM_URL="neo4j://localhost:7687"
REACT_APP_SOURCES=""
LLM_MODELS=""
ENV=""
TIME_PER_CHUNK=

Rootfolder .env

OPENAI_API_KEY = "sk-[redacted]"
DIFFBOT_API_KEY = "[redacted]"
NEO4J_URI = "[redacted]"
NEO4J_USERNAME = "[redacted]"
NEO4J_PASSWORD = "[redacted]"
BACKEND_API_URL = "http://localhost:8000"
aneeshsathe commented 3 weeks ago

@jexp saw some previous issues with a similar problem. Was a resolution found?

aneeshsathe commented 3 weeks ago

solved it.

The following have to have values in the .env file as there is no default set, also the numbers cannot be strings:

EMBEDDING_MODEL = "openai"
KNN_MIN_SCORE = 0.94
NUMBER_OF_CHUNKS_TO_COMBINE = 20
UPDATE_GRAPH_CHUNKS_PROCESSED = 20

Following is NOT OKAY:

EMBEDDING_MODEL = "openai"
KNN_MIN_SCORE = 0.94
NUMBER_OF_CHUNKS_TO_COMBINE = 20
UPDATE_GRAPH_CHUNKS_PROCESSED = 20

I believe in the default config in dockercompose.yml this is set as string and was giving my initial error, I might be wrong about this:

KNN_MIN_SCORE = ${KNN_MIN_SCORE-"0.94"}