run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.37k stars 5.19k forks source link

[Question]: how to fit the chinese when use the "SemanticSplitterNodeParser "? #15151

Closed 912100012 closed 1 month ago

912100012 commented 2 months ago

Question Validation

Question

it will show error when I load the doc which is filled by Chinese content,but it work well in English. So, How can i custom the regex to make the chinese doc work?

image

dosubot[bot] commented 2 months ago

To customize the regex in the SemanticSplitterNodeParser to work with Chinese documents, you need to modify the sentence_splitter function. Here is an example of how you can customize the sentence_splitter to use a regex that works with Chinese documents:

import re
from typing import List
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker

def chinese_sentence_splitter(text: str) -> List[str]:
    # Define a regex pattern for splitting Chinese sentences
    pattern = re.compile(r'[^!?。]*[!?。]?')
    sentences = pattern.findall(text)
    # Filter out empty strings
    sentences = [sentence for sentence in sentences if sentence]
    return sentences

# Example usage with SemanticChunker
class CustomSemanticChunker(SemanticChunker):
    def _split_text(self, text: str) -> List[str]:
        sentences = chinese_sentence_splitter(text)
        combined_sentences = combine_sentences([{"sentence": s} for s in sentences], self.buffer_size)
        embeddings = self.embed_model.get_text_embedding_batch(
            [x["combined_sentence"] for x in combined_sentences]
        )
        for i, embedding in enumerate(embeddings):
            combined_sentences[i]["embedding"] = embedding
        distances = calculate_cosine_distances(combined_sentences)
        indices_above_thresh = get_indices_above_threshold(
            distances, self.breakpoint_percentile_threshold
        )
        return make_chunks(combined_sentences, indices_above_thresh)

# Initialize the custom chunker
custom_chunker = CustomSemanticChunker(
    embed_model=OpenAIEmbedding(),  # Replace with your actual embedding model
    buffer_size=1,
    breakpoint_percentile_threshold=95.0
)

In this example, chinese_sentence_splitter is a custom function that uses a regex pattern to split Chinese text into sentences. The pattern r'[^!?。]*[!?。]?' matches sequences of characters that end with Chinese punctuation marks like , , or .

Make sure to replace OpenAIEmbedding() with the actual embedding model you are using. This should address the pydantic validation error indicating that the embeddings field received a None value, as the custom sentence_splitter will correctly split the Chinese text into sentences, allowing the SemanticChunker to process the text and generate embeddings [1][2][3][4].

To continue talking to Dosu, mention @dosu.

912100012 commented 2 months ago

as you told me ,but the line "from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker" is error,how can i use pip to solve? @dosu

dosubot[bot] commented 2 months ago

To resolve the import error for SemanticChunker from llama_index.packs.node_parser_semantic_chunking.base using pip, you need to install the llama_index package. You can do this by running the following command:

pip install llama_index

Additionally, ensure you have the necessary dependencies by installing:

pip install llama-index-embeddings-openai

This should resolve the import error and allow you to use SemanticChunker [1][2].

To continue talking to Dosu, mention @dosu.