gkamradt / langchain-tutorials

Overview and tutorial of the LangChain Library
6.63k stars 1.92k forks source link

TextSplitter in different languages #25

Closed goldengrape closed 1 year ago

goldengrape commented 1 year ago

https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/5%20Levels%20Of%20Summarization%20-%20Novice%20To%20Expert.ipynb

For summarization methods above level 3, the best practice is not to use RecursiveCharacterTextSplitter, but TokenTextSplitter, because the number of tokens corresponding to the same length of string intercepted varies greatly from language to language.

text_splitter_by_char = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=10000, chunk_overlap=500)
text_splitter_by_token = TokenTextSplitter(chunk_size=3000, chunk_overlap=100)

If this is not taken into account, errors exceeding the max token count are likely to occur when processing text in multiple languages.

I have tested the number of tokens used for the same family of patents, in different languages:

English (US10901237B2)=21823 (100%) Simplified Chinese (CN112904591A)=30901 (142%) Traditional Chinese (TW201940135A)=36530 (167%) Korean (KR20190089752A)=42644 (195%) Japanese (JP2019128599A)=51430 (236%)

gkamradt commented 1 year ago

Thank you for this! I'll take this as a best practice

promoge commented 1 year ago

I had some problems using RecursiveCharacterTextSplitter where it exceeded the recursion depth. This happened when my document set exceeded a small threshold of tokens (about 2000 in total). Anybody else experience similar issues?