hwchase17 / langchain-hub

3.26k stars 267 forks source link

CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) is creating chunks longer than the specified (1000) #46

Open sirio2013 opened 1 year ago

sirio2013 commented 1 year ago

Dear.

From this piece of code

from langchain.document_loaders import TextLoader loader = TextLoader('cleaned_catalogue.txt') documents = loader.load()

from langchain.text_splitter import CharacterTextSplitter text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents)

I keep getting chunks longer than the specified. Why?

SDcodehub commented 1 year ago

if you do not define character then CharacterTextSplittertaking separator: str = '\n\n'as Seperator.

you have to specify correct Seperator. else have to change the text splitter. plain TextSplitterwill work better if only want to split text