Move `.detach().cpu()` into `encode_core`, and option to use cuda streams

jobright-jiyuan commented 3 months ago

In encode_post of SentenceTransformerPatched we have

embeddings = out_features.detach().cpu().to(torch.float32)

On GPU if I'm understanding correctly, .cpu() triggers a device to host synchronization which will have to wait for all the computation to finish, and may be blocking for quite some time.

In batch handler, however, encode_post runs in async function _postprocess_batch, which isn't a good place for blocking code.

Is there any reason for this implementation?

michaelfeil commented 3 months ago

Fair point, thanks for checking out the code so far.

.to(torch.float32) -> required if we want to continue to make operations in .cpu()

Well, if .cpu() would be done in encode_core, the sync would prevent the next batch from running until the memory is syned to ram / cpu, right? To ask it another way: Why is encode_post running in a async function, instead of _postprocessbatch? I tried to add additional threads in #57 - making encode* functions run in an extra worker / os.fork, with 1 thread per function. So far have not gotten speedup that way.

Potentially: Add 1 more queue + thread and do additional postprocessing? If you found potential bottlenecks for the GIL, let me know (this one should be okay)

jobright-jiyuan commented 3 months ago

Fair point, thanks for checking out the code so far.

.to(torch.float32) -> required if we want to continue to make operations in .cpu()

Well, if .cpu() would be done in encode_core, the sync would prevent the next batch from running until the memory is syned to ram / cpu, right? To ask it another way: Why is encode_post running in a async function, instead of _postprocessbatch? I tried to add additional threads in #57 - making encode* functions run in an extra worker / os.fork, with 1 thread per function. So far have not gotten speedup that way.

Potentially: Add 1 more queue + thread and do additional postprocessing? If you found potential bottlenecks for the GIL, let me know (this one should be okay)

Yeah I think given the async nature of GPU computation, right now when encode_core returns, the computation on GPU actually just starts, and when encode_post returns it blocks until computation is done.

The fact that additional threads didn't help is probably due to the fact that the computation on GPU are all scheduled on default cuda stream, which will synchronize workloads submitted to it. This behavior might be desirable for large models that can reach high GPU utilization, but for small models using multiple cuda streams can probably improve throughput and latency. I guess additional threads might lead to speed up if each thread is submitting work to a different cuda stream.

BTW something unrelated: does the cache implementation lead to potential race condition? if i have something cached but when there are lots of items submitted to the queue at once, we may start running computation on a cached item before it's retrieved from cache

michaelfeil commented 3 months ago

Yeah I think given the async nature of GPU computation, right now when encode_core returns, the computation on GPU actually just starts, and when encode_post returns it blocks until computation is done.

I think you are very right here, there is no cuda.syncronize call there - that might be actually not desirable. I'll verify it with some benchmarks.
Interesting, how would I be able to submit a forward pass to multiple cuda streams?
You mean the diskcache feature which writes sqlite? If active, the diskcache and queue for gpu actually do compete. Whoever returns first will return the value to the embedding client first. The other implementaiton will try.. except and not return the future, as its already marked as completed. So far, I have not used this feature extensively - do you find it useful? Might be solved with an external caching layer to remove complexity from the service and increase development velocity - what do you think??

jobright-jiyuan commented 3 months ago

how would I be able to submit a forward pass to multiple cuda streams

I think it's something like this

stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
  model.forward(batch_to_device(inputs))

The first line creates reference to a non-default cuda stream, so one approach is to have multiple threads running encode_core, each having its own thread local stream to use for model forward, or we maintain a pool of streams and select an idle one each time we run forward. Manual cuda synchronization is probably needed for multi-stream code to work correctly.

Might be solved with an external caching layer to remove complexity from the service and increase development velocity

Yeah this is the next thing in my plan, was just trying to see how much i can go without adding an external cache as dependency. I think the current diskcache race condition issue is ok for low QPS cases, but when there are many requests arriving at once, the compute queue gets filled quickly and unnecessary computation can happen quite frequently

michaelfeil commented 3 months ago

Should be implemented.

I did not move forward with the cuda streams — as they require more vram for now.

michaelfeil / infinity

Move `.detach().cpu()` into `encode_core`, and option to use cuda streams #155