Open do-me opened 3 months ago
Thank you for this - all of these sound good.
I haven't had time to improve the tool recently but I'd love help on it if you're up for it.
On Tue, Aug 27, 2024 at 5:59 AM Dominik Weckmüller @.***> wrote:
Hey, super useful tool!
There's been some development in the chunking community. If you'd like to keep your app up to date here are a few suggestions. Also, considerung that all of the options struggle with correctly identifying sentence boundaries (quickly tested with some texts) and tend to chop off parts, it would be nice to have more choice.
Python
- https://github.com/benbrandt/text-splitter - Python API for Rust Package, at some point also available in JS via WebAssembly. It's my personal preference at the moment, yields "human-like" chunks
- https://github.com/umarbutler/semchunk - claims to be faster, didn't test enough yet to evaluate
JS
- https://github.com/askorama/chunker - didn't test yet, looks like a very simplistic tool, no documentation afaik
- https://gist.github.com/hanxiao/3f60354cf6dc5ac698bc9154163b4e6a - JinaAI tokenizer. See LinkedIn post here https://www.linkedin.com/posts/hxiao87_based-%F0%9D%90%92%F0%9D%90%9E%F0%9D%90%A6%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%AD%F0%9D%90%A2%F0%9D%90%9C-%F0%9D%90%9C%F0%9D%90%A1%F0%9D%90%AE%F0%9D%90%A7%F0%9D%90%A4%F0%9D%90%A2%F0%9D%90%A7%F0%9D%90%A0-activity-7230113200833253376-66b1 and read first comment for some exceptions; didn't test yet.
Maybe another idea would be to include the option to allow for any regex like we did in SemanticFinder https://github.com/do-me/SemanticFinder. I tried to come up with a good regex for sentence boundaries but it's incredibly hard.
— Reply to this email directly, view it on GitHub https://github.com/gkamradt/ChunkViz/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACK22PJJ34IQJBGCFMFOSX3ZTRZZPAVCNFSM6AAAAABNGCUJSGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ4DSMRZGIZTMOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Greg Kamradt Twitter https://twitter.com/GregKamradt, LinkedIn https://www.linkedin.com/in/gregkamradt/
Hey, super useful tool!
There's been some development in the chunking community. If you'd like to keep your app up to date here are a few suggestions. Also, considerung that all of the options struggle with correctly identifying sentence boundaries (quickly tested with some texts) and tend to chop off parts, it would be nice to have more choice.
Python
JS
Maybe another idea would be to include the option to allow for any regex like we did in SemanticFinder. I tried to come up with a good regex for sentence boundaries but it's incredibly hard.