maikel / senders-io

An adaption of Senders/Receivers for async networking and I/O
Apache License 2.0
14 stars 2 forks source link

Building senders-io #51

Closed mfbalin closed 1 year ago

mfbalin commented 1 year ago

Great work! I am trying to experiment with the file_handle API in this library. I added stdexec dependency using CPM but couldn't get the library to compile. What version of the compiler and stdexec do I need to build this library?

Also, what is the use of passing multiple buffers to the read API? From what I can see, the read happens from a single offset, and the buffers are filled one after another. It would be nice to have a batched read that takes a span of buffers and a span of offsets in one-to-one correspondence.

mfbalin commented 1 year ago

I also want to use stdexec along with the capabilities presented in this library to form a backend for my Graph Neural Network training framework for huge graphs that don't fit into the RAM. What is important to me is to be able to use the full capacity of the SSD with batched random accesses while using senders for everything. I am open to collaborating so that the missing pieces I might need in the future are available sooner :).

maikel commented 1 year ago

Great work! I am trying to experiment with the file_handle API in this library. I added stdexec dependency using CPM but couldn't get the library to compile. What version of the compiler and stdexec do I need to build this library?

The best reference is currently the github action. I've build on top of the member cpo branch that ville has originally started.

Also, what is the use of passing multiple buffers to the read API? From what I can see, the read happens from a single offset, and the buffers are filled one after another. It would be nice to have a batched read that takes a span of buffers and a span of offsets in one-to-one correspondence.

If the vectorizes api allows that, then it is an oversight on my sight. There is still some usage with the current api since you each buffer could contain a different header, such as ETH, IP, Tcp etc and you don't need to copy those header structs in a single contiguous buffer

mfbalin commented 1 year ago

The API could look like read_batched(std::span<buffer_type> buffers, std::span<offset_type> offsets) with buffers.size() == offsets.size() and when awaited upon the sender returned by read_batched, the buffers would have the read items, specifically buffers[i][j] == file_data[offsets[i] + j], forall 0 <= j < buffers[i].size(), 0 <= i < buffers.size(). The buffers could be offsets into a huge buffer, which can be registered into io_uring so that the copies can be performed without any extra buffers.

maikel commented 1 year ago

I also want to use stdexec along with the capabilities presented in this library to form a backend for my Graph Neural Network training framework for huge graphs that don't fit into the RAM. What is important to me is to be able to use the full capacity of the SSD with batched random accesses while using senders for everything. I am open to collaborating so that the missing pieces I might need in the future are available sooner :).

I'm happy to help and being helped. I think I must focus on documentation and tests at this point. I'm still unhappy with some things around sequence senders but those might not be necessary for you

mfbalin commented 1 year ago

I was able to get it to compile after looking into github actions as you suggested. Thanks! Closing this issue but I would like to keep in touch in the future.

maikel commented 1 year ago

The API could look like read_batched(std::span<buffer_type> buffers, std::span<offset_type> offsets) with buffers.size() == offsets.size() and when awaited upon the sender returned by read_batched, the buffers would have the read items, specifically buffers[i][j] == file_data[offsets[i] + j], forall 0 <= j < buffers[i].size(), 0 <= i < buffers.size(). The buffers could be offsets into a huge buffer, which can be registered into io_uring so that the copies can be performed without any extra buffers.

This doesn't directly map to calls of a single async preadv function. Read batched is a composed algorithm. I think, I can provide you with a simple implementation using sio::iterate and sio::async::read. Would this help?

mfbalin commented 1 year ago

It certainly would so long as the overhead of the abstractions used to issue the reads into the queue is small and we can get full performance. I am talking about a lot of small reads, on average 400 bytes per buffer and thousands of reads per batch.

What is the exact difference between read and read_some? I am not used to the sender abstractions fully yet so it is a bit hard to decode what the code is doing. From what I understand, read_some returns when buffers are full or the file is exhausted. Read seems to always exhaust the file. However, if buffers get full, does read return multiple times and give the control back to the user with full buffers?

maikel commented 1 year ago

It certainly would so long as the overhead of the abstractions used to issue the reads into the queue is small and we can get full performance. I am talking about a lot of small reads, on average 400 bytes per buffer and thousands of batched reads.

There is some freedom in the algorithm. Each offset needs a separate read operation being submitted to io urings queue. Then it's a question how many active operations you allow at once. Each active operation need some storage. I have implemented an async memory pool to naturally limit the number of active ops dynamically at run time.

What is the exact difference between read and read_some? I am not used to the sender abstractions fully yet so it is a bit hard to decode what the code is doing. From what I understand, read_some returns when buffers are full or the file is exhausted. Read seems to always exhaust the file. However, if buffers get full, does read return multiple times and give the control back to the user with full buffers?

One submitted read operation can complete with less than requested bytes. That's how the syscall works. Read some submits exactly one read op an returns the number of bytes that have been read. The read algorithm submits a consecutive read until all bytes have been read

mfbalin commented 1 year ago

Cool, I am excited to see how much performance we can extract with a simple implementation. I will be benchmarking it as soon as I can get my hands on it. If it is really simple to implement with what already exists in the library, any hint on how to implement it so I can start experimenting?

There is some freedom in the algorithm. Each offset needs a separate read operation being submitted to io urings queue. Then it's a question how many active operations you allow at once. Each active operation need some storage. I have implemented an async memory pool to naturally limit the number of active ops dynamically at run time.

SSDs can be quite fast and they are capable of a lot of random reads at once. From what I know, NVMe queue depth can be as large as 65k. Thus, we might need thousands of read operations in flight to extract full performance especially when each individual read operation in a batch reads a few hundred bytes.

maikel commented 1 year ago

It's really just a combination of iterate, fork, let value each and async read. I will push a test later.