[Bug]: Kernel crashing with MetadataExtractor() when get_nodes_from_documents()

kevon217 commented 1 year ago

Bug Description

My kernel keeps crashing when I use the KeywordExtractor() while extracting nodes from a list of 178 documents. The document nodes don't have a lot of text either, so I don't think it's an issue with length. I do add my own metadata to the documents before using an LLM to extract keywords. It works fine, however, if I do it on a list of 5 documents using either OpenAI or Llama-2.

I've tried it in VS Code and Pycharm, so don't think it is an IDE issue.

llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo-16k")
metadata_extractor = MetadataExtractor(
    extractors=[KeywordExtractor(keywords=10, llm=llm),
    ],
)
node_parser = SimpleNodeParser.from_defaults(metadata_extractor=metadata_extractor, include_metadata=True, callback_manager=callback_manager)
nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)

I also tried it with llm = LLMPredictor(OpenAI(temperature=0.1, model="gpt-3.5-turbo-16k")) but that didn't work either.

Let me know if there's any additional information that is needed to troubleshoot this issue.

Version

0.8.4

Steps to Reproduce

doc_id_col = 'id'
documents=[]
for idx, row in df_studies.iterrows():
    doc = row[text_col[0]]
    meta = {val: row[val] for val in df_studies.columns if val not in text_col}
    document = Document(
        text=doc,
        metadata=meta,
        excluded_embed_metadata_keys=metadata_cols_exclude, # list(meta.keys()),
        excluded_llm_metadata_keys=metadata_cols_exclude, # list(meta.keys()),
        metadata_seperator="::",
        metadata_template="{key}=>{value}",
        text_template="Metadata: {metadata_str}\n-----\nContent: {content}",
    )
    document.id_ = row[doc_id_col]
    documents.append(document)

llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo-16k")
metadata_extractor = MetadataExtractor(
    extractors=[KeywordExtractor(keywords=10, llm=llm),
    ],
)
node_parser = SimpleNodeParser.from_defaults(metadata_extractor=metadata_extractor, include_metadata=True, callback_manager=callback_manager)
nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)

Relevant Logs/Tracbacks

No response

logan-markewich commented 1 year ago

Is there any other information printed to the console? Or it just straight up crashes?

logan-markewich commented 1 year ago

Just to confirm, you tried running a python file with the code directly?

dosubot[bot] commented 1 year ago

🤖

Hello,

Thank you for providing detailed information about the issue you're experiencing. Based on the context provided, there doesn't seem to be a known issue in the LlamaIndex codebase that causes kernel crashes when using the KeywordExtractor() with a large number of documents. Also, there isn't a specified limit to the number of documents that the KeywordExtractor() can process at once.

However, it's worth noting that the BaseKeywordTableIndex class, which uses the KeywordExtractor(), has a parameter max_keywords_per_chunk that limits the number of keywords extracted per chunk of text. This could indirectly limit the number of documents processed at once if a document is considered a chunk. The actual limit would still be determined by the available system resources (such as memory and CPU), rather than a hard-coded limit in the code.

Given this, it's possible that the kernel crash you're experiencing might be due to resource limitations on your system, especially if you're processing a large number of documents.

To help us further investigate this issue, could you please provide more information about your system's specifications (e.g., RAM, CPU, etc.)? Also, if there are any error messages or logs when the kernel crashes, those would be very helpful.

Thank you for your patience and cooperation.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

kevon217 commented 1 year ago

@logan-markewich , it just crashes. In Pycharm I'm running the code directly in the console, not Jupyter.

Below is JupyterKernelCrash log from VS Code. Pycharm just gave me exit code -1073741571 (0xC00000FD).

Visual Studio Code - Insiders (1.82.0-insider, undefined, desktop) Jupyter Extension Version: 2023.7.1002162226. Python Extension Version: 2023.14.0. Platform: win32 (x64). Workspace folder ~\Desktop\VS_Code_Projects\BRICS\brics-tools, Home = c:\Users\armengolkm 15:02:54.539 [info] User belongs to experiment group 'FastKernelPicker' 15:02:54.539 [info] User belongs to experiment group 'NewRemoteUriStorage' 15:02:54.539 [info] User belongs to experiment group 'PasswordManager' 15:02:54.539 [info] User belongs to experiment group 'NewJupyterSession' 15:02:56.233 [info] Start refreshing Kernel Picker (1692385376233) 15:02:56.249 [info] Using Pylance 15:02:59.527 [info] Starting Kernel startUsingPythonInterpreter, .jvsc74a57bd05196cd751021d6cca4d123eb88a71ebd2a0de9adefb82ede204f0985d1c46689.c:\Users\\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe.c:\Users\\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe.-m#ipykernel_launcher (Python Path: ~\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe, Poetry, .venv, 3.10.2) for '~\Desktop\VS_Code_Projects\BRICS\brics-tools\examples\data_repository\study_semantic_search.ipynb' (disableUI=true) 15:03:12.688 [info] Start refreshing Interpreter Kernel Picker 15:03:16.093 [info] Process Execution: ~\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe -m pip list 15:03:16.654 [info] Process Execution: ~\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe -c "import ipykernel; print(ipykernel.version); print("5dc3a68c-e34e-4080-9c3e-2a532b2ccb4d"); print(ipykernel.file)" 15:03:21.058 [warn] Failed to get activated env vars for ~\AppData\Local\Programs\Python\Python311\python.exe in 4399ms 15:03:21.586 [info] Process Execution: ~\AppData\Local\Programs\Python\Python311\python.exe -c "import site;print("USER_BASE_VALUE");print(site.USER_SITE);print("USER_BASE_VALUE");" 15:03:24.085 [info] Process Execution: ~\AppData\Local\Programs\Python\Python311\python.exe c:\Users\.vscode-insiders\extensions\ms-toolsai.jupyter-2023.7.1002162226-win32-x64\pythonFiles\vscode_datascience_helpers\kernel_interrupt_daemon.py --ppid 2852

cwd: ~.vscode-insiders\extensions\ms-toolsai.jupyter-2023.7.1002162226-win32-x64\pythonFiles\vscode_datascience_helpers 15:03:25.540 [info] ipykernel version & path 6.25.1, ~\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\lib\site-packages\ipykernel__init__.py for ~\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe 15:03:27.089 [info] Process Execution: ~\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\Scripts\python.exe -m ipykernel_launcher --ip=127.0.0.1 --stdin=9003 --control=9001 --hb=9000 --Session.signature_scheme="hmac-sha256" --Session.key=b"a02be18d-3f69-44c5-a15c-6439cdef1cae" --shell=9002 --transport="tcp" --iopub=9004 --f=c:\Users\\AppData\Roaming\jupyter\runtime\kernel-v2-28527T0hV6Bhmbbu.json cwd: ~\Desktop\VS_Code_Projects\BRICS\brics-tools\examples\data_repository 15:03:42.042 [warn] StdErr from Kernel Process c:\Users\\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\lib\site-packages\traitlets\traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5. warn( 15:03:42.043 [warn] StdErr from Kernel Process c:\Users\\Desktop\VS_Code_Projects\BRICS\brics-tools.venv\lib\site-packages\traitlets\traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use 'a02be18d-3f69-44c5-a15c-6439cdef1cae' instead of 'b"a02be18d-3f69-44c5-a15c-6439cdef1cae"'. warn( 15:03:59.051 [info] End refreshing Kernel Picker (1692385376233)

logan-markewich commented 1 year ago

Can you open a regular terminal (no pycharm/vscode/jupyter) and just run the code like python file_name.py ?

kevon217 commented 1 year ago

Can you open a regular terminal (no pycharm/vscode/jupyter) and just run the code like python file_name.py ?

@logan-markewich , no luck. i'll be debugging some more tomorrow morning and will let you know if I get anywhere.

kevon217 commented 1 year ago

@logan-markewich, so I finally got the KeywordExtractor() to proceed by excluding metadata as below:


llm = OpenAI(temperature=0.1, model=llm_model_name)

service_context = ServiceContext.from_defaults(llm=llm, context_window = 4096)

text_splitter = TokenTextSplitter(
  separator=" ",
  chunk_size=2048,
  chunk_overlap=20,
  backup_separators=["\n"],
  tokenizer=tiktoken.encoding_for_model(llm_model_name).encode
)

metadata_extractor = MetadataExtractor(
    extractors=[KeywordExtractor(keywords=10, llm=llm)],)

node_parser = SimpleNodeParser(
    text_splitter=text_splitter,
    include_metadata=False,
    metadata_extractor=metadata_extractor,
)

I'm not sure if it had to do something with the metadata string's text, but I didn't need it for extracting keywords with an llm at this step. Running doc.get_content(metadata_mode=MetadataMode.LLM)

Metadata: title=>Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) - Adult::studyType=>Epidemiology::fundingSource=>NIH-NINDS::keywordSet=>[]::submissionType=>['Imaging', 'Genomics', 'Clinical Assessment']::goals=>Specific Aim 1: To create a widely accessible, comprehensive TBI Information Commons that integrates clinical, imaging, proteomic, genomic, and outcome biomarkers from subjects across the age and injury spectra, and provides analytic tools and resources to support TBI research.  

Specific Aim 2: To validate imaging, proteomic, and genetic biomarkers that will improve classification of TBI, permit appropriate selection and stratification of patients for clinical trials, and contribute to the development of a new taxonomy for TBI. We hypothesize that validated imaging, proteomic, and genetic biomarkers will
permit improved patient classification, beyond traditional categories of mild, moderate and severe TBI.   

Specific Aim 3: To evaluate a flexible outcome assessment battery in adult patients comprising a broad range of TBI&#8208;CDEs that enables assessment of multiple outcome domains across all phases of recovery and at all levels of TBI severity.  

Specific Aim 4: To determine which tests, treatments, and services are effective and appropriate for which TBI patients, and use this evidence to recommend practices that offer the best value.  ::numberOfSubjects=>3000.0::fundingAmount=>nan
-----
Content: Effective treatment of traumatic brain injury (TBI) remains one of the greatest unmet needs in public health. Each year in the US, at least 1.7 million people suffer TBI; an estimated 3.2 to 5.3 million people live with the long-term physical, cognitive, and psychological health disabilities of TBI, with annual direct and indirect costs estimated at over $60 billion. The unique public-private partnership of investigators, philanthropy, and industry leaders brought together in the multicenter Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) proposal share a mission to accelerate clinical research in TBI. The goal is to create a large, high quality TBI database that integrates clinical, imaging, proteomic, genomic, and outcome biomarkers, and provides analytic tools and resources to establish more precise methods for TBI diagnosis and prognosis, improve outcome assessment, and compare the effectiveness and costs of tests, treatments, and services. The investigators hypothesize that this approach will permit better characterization and stratification of patients, allowing meaningful comparisons of treatments and outcomes, and thereby improving the next generation of clinical trials.
Specific Aim 1. To create a widely accessible, comprehensive TBI Information Commons that integrates clinical, imaging, proteomic, genomic, and outcome biomarkers from subjects across the age and injury spectra, and provides analytic tools and resources to support TBI research.  
Specific Aim 2. To validate imaging, proteomic, and genetic biomarkers that will improve classification of TBI, permit appropriate choice and stratification of patients for clinical trials, and contribute to the development of a new taxonomy for TBI. 
Specific Aim 3. To evaluate a flexible outcome assessment battery comprised of a broad range of TBI common data elements that enables assessment of multiple outcome domains across all phases of recovery and at all levels of TBI severity. 
Specific Aim 4. To determine which tests, treatments, and services are effective and appropriate for which TBI patients, and use this evidence to recommend practices that offer the best value. 
The project will directly impact public health by creating an open-access Information Commons populated with robust Common Data Elements that will make international research collaboration a reality. Detailed clinical data on 3,000 subjects (11 sites) across the injury spectrum, along with CT/MRI imaging, blood biospecimens, and detailed outcomes, will be collected and analyzed, permitting the identification/validation of biomarkers, and identification of structural abnormalities that may be predictive of outcomes, making strides toward a new taxonomy for TBI. The infrastructure of integrated databases and imaging and biosample repositories will create a high quality, legacy database for current and future generations of international researchers.

logan-markewich commented 1 year ago

This is fixed as metadata extractors now have an option to pick which metadata to include

KeywordExtractor(..., metadata_mode=MetadataMode.NONE)

run-llama / llama_index