Open TimMikeladze opened 1 year ago
The API could be like this:
const stream = new HfInference().batch().textGeneration([...] | ...).textToImage([...] | ...);
for await (const output of stream) {
}
The .batch()
would be a clear delimiter between streaming & non-streaming endpoints, and the results would be gotten one by one as soon as available.
We can also allow async iterable as parameters (rather than arrays), so there can be an upload stream too. We should also add a parameter to batch
, defining the parallel inferences:
const stream = new HfInference.batch({concurrency: X})
So it only sends X data at a time. Like the concurrency
param of promisesQueueStreaming
. Default value would need to consider both inference API and inference endpoints.
Maybe we should do two functions:
batchOrdered
to get all results in the same order as requestsbatchUnordered
to get results as soon as generated regardless of order. In which case we need to give the caller some way to match in inputs & outputs, eg an id
param in addition to the other args, also test in the reponse.@coyotte508 Hi! Is stream/bulk feature implemented? Because I have an issue here: https://github.com/huggingface/api-inference-community/issues/194#issuecomment-1513409183
It's not implemented client-side but should be serverside. The issue you linked is filed in the correct place, you should get an answer in the coming days :)
I have the same issue huggingface/api-inference-community#194. Do you know when this streaming feature is expected to work?
https://huggingface.co/docs/api-inference/parallelism#streaming
Important: A pro account is required to use and test streaming. I began a partial implementation to add streaming support several months ago. Leaving this patch below for future reference.