[Question]: how to fit the chinese when use the "SemanticSplitterNodeParser "?

912100012 commented 2 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

it will show error when I load the doc which is filled by Chinese content,but it work well in English. So, How can i custom the regex to make the chinese doc work?

dosubot[bot] commented 2 months ago

To customize the regex in the SemanticSplitterNodeParser to work with Chinese documents, you need to modify the sentence_splitter function. Here is an example of how you can customize the sentence_splitter to use a regex that works with Chinese documents:

import re
from typing import List
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker

def chinese_sentence_splitter(text: str) -> List[str]:
    # Define a regex pattern for splitting Chinese sentences
    pattern = re.compile(r'[^！？。]*[！？。]?')
    sentences = pattern.findall(text)
    # Filter out empty strings
    sentences = [sentence for sentence in sentences if sentence]
    return sentences

# Example usage with SemanticChunker
class CustomSemanticChunker(SemanticChunker):
    def _split_text(self, text: str) -> List[str]:
        sentences = chinese_sentence_splitter(text)
        combined_sentences = combine_sentences([{"sentence": s} for s in sentences], self.buffer_size)
        embeddings = self.embed_model.get_text_embedding_batch(
            [x["combined_sentence"] for x in combined_sentences]
        )
        for i, embedding in enumerate(embeddings):
            combined_sentences[i]["embedding"] = embedding
        distances = calculate_cosine_distances(combined_sentences)
        indices_above_thresh = get_indices_above_threshold(
            distances, self.breakpoint_percentile_threshold
        )
        return make_chunks(combined_sentences, indices_above_thresh)

# Initialize the custom chunker
custom_chunker = CustomSemanticChunker(
    embed_model=OpenAIEmbedding(),  # Replace with your actual embedding model
    buffer_size=1,
    breakpoint_percentile_threshold=95.0
)

In this example, chinese_sentence_splitter is a custom function that uses a regex pattern to split Chinese text into sentences. The pattern r'[^！？。]*[！？。]?' matches sequences of characters that end with Chinese punctuation marks like ！, ？, or 。.

Make sure to replace OpenAIEmbedding() with the actual embedding model you are using. This should address the pydantic validation error indicating that the embeddings field received a None value, as the custom sentence_splitter will correctly split the Chinese text into sentences, allowing the SemanticChunker to process the text and generate embeddings ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

912100012 commented 2 months ago

as you told me ,but the line "from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker" is error,how can i use pip to solve? @dosu

dosubot[bot] commented 2 months ago

To resolve the import error for SemanticChunker from llama_index.packs.node_parser_semantic_chunking.base using pip, you need to install the llama_index package. You can do this by running the following command:

pip install llama_index

Additionally, ensure you have the necessary dependencies by installing:

pip install llama-index-embeddings-openai

This should resolve the import error and allow you to use SemanticChunker ^[1]^[2].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: how to fit the chinese when use the "SemanticSplitterNodeParser "? #15151

Question Validation

Question