langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
94.9k stars 15.37k forks source link

What units to use for threshold amount? #26171

Open mvirag2000 opened 2 months ago

mvirag2000 commented 2 months ago

URL

https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/

Checklist

Issue with current documentation:

It seems that units for threshold-type = "percentage" are out of a hundred, i.e., 85.0 not 0.85, and this is also unclear for the other threshold types, "gradient," and "interquartile."

Idea or request for content:

Also, Semantic Chunker really needs a min and max chunk size. I am getting chunks of a single word, and chunks that exceed the OpenAI limit. Thanks for all the great work on LangChain.

tibor-reiss commented 2 months ago

@mvirag2000 What do you think about the linked PR? Re your idea/request: I only introduced min_chunk_size, because the max size of chunks can be adjusted by tuning breakpoint_threshould_amount to a reasonable value.