Open Avsecz opened 5 years ago
@shabnamsadegh I have another idea how to implement this that could also benefit the core batch writers:
We could implement an AsyncBatchWriter
class in kipoi.writers
that takes another batch writer and makes it asynchronous. The process loop should just run the while loop where it immediately runs batch_writer.batch_write()
.
class AsyncBatchWriter(BatchWriter):
def __init__(self, batch_writer, max_queue_size=100):
"""
Args:
max_queue_size: maximal queue size. If it gets larger then batch_write needs to wait
till it can write to the queue again.
"""
self.batch_writer = batch_writer
# start the process and instantiate the queue
self.queue = ...
self.process = ...
@abstractmethod
def batch_write(self, batch):
"""Write a single batch of data
Args:
batch is one batch of data (nested numpy arrays with the same axis 0 shape)
"""
if self.queue.size() > self.max_queue_size:
# display warning. Wait till the queue is not small enough
self.queue.put(batch)
@abstractmethod
def close(self):
"""Close the file
"""
# stop the process,
# make sure the queue is empty
# close the file
self.batch_writer.close()
With this approach we would just need to add that class to kipoi.writers
and then change this line of code to:
extra_writers = [SyncBatchWriter(AsyncBatchWriter(writer))]
Note that SyncBatchWriter
is very confusing as it's actually converts the variant scores to the input usable by BatchWriters
- [x] buffer writes - https://github.com/kipoi/kipoi-veff/pull/21 (e.g. don't write predictions to disk on every batch but only every now and then)
- [ ] use asynchronous writes
Here is the main loop performing:
https://github.com/kipoi/kipoi-veff/blob/master/kipoi_veff/snv_predict.py#L620-L658
- [ ] setup some standardized benchmarks to test the overhead
Tasks
Follow the following notebook: https://github.com/kipoi/kipoi-veff/blob/write_buffer/notebooks/code-profiling.ipynb
Finish the code on the
write buffer
PR by speeding up the writing to take minimal amount of time.