umarbutler / semchunk

A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
MIT License
186 stars 9 forks source link

Request: offer and and generators and async generators #6

Open Goldziher opened 5 months ago

Goldziher commented 5 months ago

Hi there!

Thanks for this neat library. I'm giving it a go.

It would be great to have two variants of the chunkerify function that return a generator and async generator, and a version that is async.

Use cases:

The simplest option (but non performant) version for implementing async logic, is simply to execute the sync version using something like anyio.to_thread.run_sync: https://anyio.readthedocs.io/en/stable/threads.html.

umarbutler commented 5 months ago

Offering a generator chunker and perhaps even support for lazy chunking is something I’m open to. I’ll start work on that shortly.

With regard to offering an asynchronous generator, I’m not too sure what value there would be in that when there isn’t anything I’m aware of in my chunker that is IO-bound. And seeing as synchronous functions and generators are already callable within asynchronous environments, making chunkers asynchronous would only seem to add more overhead. If there’s something I’m missing here, however, please let me know.

Goldziher commented 5 months ago

Offering a generator chunker and perhaps even support for lazy chunking is something I’m open to. I’ll start work on that shortly.

With regard to offering an asynchronous generator, I’m not too sure what value there would be in that when there isn’t anything I’m aware of in my chunker that is IO-bound. And seeing as synchronous functions and generators are already callable within asynchronous environments, making chunkers asynchronous would only seem to add more overhead. If there’s something I’m missing here, however, please let me know.

using an asnyc iterator / generator allows for streaming the source rather than loading it all into memory.

umarbutler commented 5 months ago

So you imagine it being used to handle inputs that are async iterators, is that right? For example:

chunker = chunkerify(...)
texts = my_async_text_generator()

# Normally you'd do this:
chunks = [chunker(text) async for text in texts]

# But you'd like to be able to do this(?)
chunks = await chunker(texts)
Goldziher commented 5 months ago

So you imagine it being used to handle inputs that are async iterators, is that right? For example:

chunker = chunkerify(...)
texts = my_async_text_generator()

# Normally you'd do this:
chunks = [chunker(text) async for text in texts]

# But you'd like to be able to do this(?)
chunks = await chunker(texts)

For a stream I would use an async iterator (e.g. async generator)

But using async for chunking is purely for IO bound situations, like using chunking in an API. The advantage of

chunks = await chunker(texts)

Is that this will be ran in an async worker thread rather than the main thread, and thus not block the execution of other async threads.

I can fake it by doing something like

await anyio.to_thread.run_sync(chunker, texts)

But this is pretty suboptimal since it slows execution quite a bit.