Closed 912100012 closed 1 month ago
To customize the regex in the SemanticSplitterNodeParser
to work with Chinese documents, you need to modify the sentence_splitter
function. Here is an example of how you can customize the sentence_splitter
to use a regex that works with Chinese documents:
import re
from typing import List
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker
def chinese_sentence_splitter(text: str) -> List[str]:
# Define a regex pattern for splitting Chinese sentences
pattern = re.compile(r'[^!?。]*[!?。]?')
sentences = pattern.findall(text)
# Filter out empty strings
sentences = [sentence for sentence in sentences if sentence]
return sentences
# Example usage with SemanticChunker
class CustomSemanticChunker(SemanticChunker):
def _split_text(self, text: str) -> List[str]:
sentences = chinese_sentence_splitter(text)
combined_sentences = combine_sentences([{"sentence": s} for s in sentences], self.buffer_size)
embeddings = self.embed_model.get_text_embedding_batch(
[x["combined_sentence"] for x in combined_sentences]
)
for i, embedding in enumerate(embeddings):
combined_sentences[i]["embedding"] = embedding
distances = calculate_cosine_distances(combined_sentences)
indices_above_thresh = get_indices_above_threshold(
distances, self.breakpoint_percentile_threshold
)
return make_chunks(combined_sentences, indices_above_thresh)
# Initialize the custom chunker
custom_chunker = CustomSemanticChunker(
embed_model=OpenAIEmbedding(), # Replace with your actual embedding model
buffer_size=1,
breakpoint_percentile_threshold=95.0
)
In this example, chinese_sentence_splitter
is a custom function that uses a regex pattern to split Chinese text into sentences. The pattern r'[^!?。]*[!?。]?'
matches sequences of characters that end with Chinese punctuation marks like !
, ?
, or 。
.
Make sure to replace OpenAIEmbedding()
with the actual embedding model you are using. This should address the pydantic
validation error indicating that the embeddings
field received a None
value, as the custom sentence_splitter
will correctly split the Chinese text into sentences, allowing the SemanticChunker
to process the text and generate embeddings [1][2][3][4].
To continue talking to Dosu, mention @dosu.
as you told me ,but the line "from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker" is error,how can i use pip to solve? @dosu
To resolve the import error for SemanticChunker
from llama_index.packs.node_parser_semantic_chunking.base
using pip, you need to install the llama_index
package. You can do this by running the following command:
pip install llama_index
Additionally, ensure you have the necessary dependencies by installing:
pip install llama-index-embeddings-openai
This should resolve the import error and allow you to use SemanticChunker
[1][2].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
it will show error when I load the doc which is filled by Chinese content,but it work well in English. So, How can i custom the regex to make the chinese doc work?