[Question]: Problems with SentenceWindowNodeParser

Steinkreis commented 5 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

When i try to run an ingestion pipeline with the SentenceWindowNodeParser i get following Error:

Traceback (most recent call last): File "/home/ubuntu/projects/wimi/src/llama_index/ingest_sentence_window.py", line 38, in <module> pipeline.run(documents=documents, num_workers=4) File "/home/ubuntu/miniconda3/envs/wimi/lib/python3.10/site-packages/llama_index/core/ingestion/pipeline.py", line 738, in run nodes_parallel = p.starmap( File "/home/ubuntu/miniconda3/envs/wimi/lib/python3.10/multiprocessing/pool.py", line 375, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/home/ubuntu/miniconda3/envs/wimi/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value File "/home/ubuntu/miniconda3/envs/wimi/lib/python3.10/multiprocessing/pool.py", line 540, in _handle_tasks put(task) File "/home/ubuntu/miniconda3/envs/wimi/lib/python3.10/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/home/ubuntu/miniconda3/envs/wimi/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) AttributeError: Can't pickle local object 'split_by_sentence_tokenizer.<locals>.split'

Code (I implemented a custom reader that returns list[Document]):

from qdrant_client import QdrantClient
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from wimi_commons.llama_index.documents import JSONReader, DummyTextCleaner
from pathlib import Path

DATA_DIR = str(Path(__file__).parents[2] / "data")

if __name__=="__main__":
    client = QdrantClient(host="localhost", grpc_port=6334, prefer_grpc=True)
    vector_store = QdrantVectorStore(client=client, collection_name="intfloat-multilingual-e5-large-collection-llama-index-sentence-window")
    embed_model = HuggingFaceEmbedding(model_name="intfloat/multilingual-e5-large")
    documents =JSONReader(
        file_path=f"{DATA_DIR}/solr_with_content_llama_index.json",
        content_key="text"
    ).load_data()
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=3,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )
    pipeline = IngestionPipeline(
        transformations=[
            SentenceSplitter(chunk_size=2000, chunk_overlap=0),
            node_parser,
            embed_model,
        ],
        vector_store=vector_store,
    )
    pipeline.run(documents=documents, num_workers=4)
    pipeline.persist(f"{DATA_DIR}/pipeline_storage")

I don't know what I'm doing wrong here.. any suggestions?

dosubot[bot] commented 5 months ago

Hey @Steinkreis, great to see you back! Hope everything's been going well on your end.

The error you're encountering, AttributeError: "Can't pickle local object 'split_by_sentence_tokenizer.<locals>.split'", typically occurs when Python's pickle module attempts to serialize (pickle) an object that is defined within a function or is a lambda function. These local objects cannot be pickled because they are not accessible at the top level of their module, and thus their definition cannot be serialized by pickle.

In the context of the SentenceWindowNodeParser class from the LlamaIndex repository, the issue likely stems from how the sentence_splitter is being set up. The sentence_splitter is a callable that defaults to split_by_sentence_tokenizer, which seems to be a function that returns another function (or is a function with local functions within it). If split_by_sentence_tokenizer or the function it returns uses local functions, these cannot be pickled, which is necessary for multiprocessing or distributed computing tasks often used in ingestion pipelines.

To resolve this issue, ensure that any function passed as sentence_splitter or returned by split_by_sentence_tokenizer does not contain local functions or lambda expressions. Instead, define any required function at the module level. If split_by_sentence_tokenizer is indeed returning a function that includes local definitions, you would need to refactor it to avoid this pattern.

Without the exact implementation of split_by_sentence_tokenizer, I cannot provide a specific solution, but the general approach would be to ensure that all functions involved in the sentence_splitter are defined at the top level of a module and not as local functions within another function.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

ksjadeja commented 5 months ago

@Steinkreis @logan-markewich I am facing a similar issue but with SemanticSplitterNodeParser. Is this going to get fixed? I have a good stackoverflow link that mentions how to deal with local functions. Maybe that can be implemented. https://stackoverflow.com/questions/72766345/attributeerror-cant-pickle-local-object-in-multiprocessing/72776044#72776044

logan-markewich commented 5 months ago

@ksjadeja I definitely welcome a PR

ksjadeja commented 5 months ago

@logan-markewich , I am willing to work on it. However, I am not sure that this is the only reason for the failing of SemanticSplitterNodeParser. From the error, it seems that the local function is the only issue, and I am willing to push that change. Also, should I open a new issue for that PR, or update here on the existing issue?

run-llama / llama_index