nfsv3 async write sometimes send malformed packets

sahlberg / libnfs

NFS client library

Other

510 stars 200 forks source link

nfsv3 async write sometimes send malformed packets #445

Closed amulyan13 closed 5 months ago

amulyan13 commented 6 months ago

Looks like there is some bug in the write code (nfs_pwrite_async) which causes the write request to be sent with garbage arguments. Different fields of the write request gets garbled, like filehandle, length etc.

Error: Server responded: Garbage arguments

sahlberg commented 6 months ago

On Tue, 12 Mar 2024 at 15:55, amulyan13 @.***> wrote:

Looks like there is some bug in the write code (nfs_pwrite_async) which causes the write request to be sent with garbage arguments. Different fields of the write request gets garbled, like filehandle, length etc.

Error: Server responded: Garbage arguments

image.png (view on web) https://github.com/sahlberg/libnfs/assets/162989671/55b26654-f16b-4a31-bbf2-868614a661d7

Hmmm. This is odd, this should be very stable as many very very large processing pipelines use this.

1, Can you show me the hex-data for this packet as well as the next few packets in this tcp session in case I can see some pattern or if this is pure random corruption or say packet data that is interleaved from different PDUs.

2, Is this current master? Can you try the latest release: tag 5.0.3. That is the old API and should be very stable. Since then, recently, current master contains both a major API change to allow zero-read READs as well as some refactoring to add GSSAPIv1 support. In case this is a regression in the major changes in the new code or not.

Note that 5.0.3 and current master has a different api and function signatures for [p]read and [p]write calls

3, Can you tell me about the application? It uses the async api, but is it single or multi-threaded? By default libnfs is purely event-driven and does not support multithreading. Multithreading can be used on some platforms (linux and win32) but requires to be built specifically for multithreading support. It also requires some extra library calls to spawn a worker thread for the socket before multithreading can be used.

regards ronnie s

— Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/445, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EALANU7Y6M4KY2QPYLYX2RGZAVCNFSM6AAAAABERULS66VHI2DSMVQWIX3LMV43ASLTON2WKOZSGE4DAOBWHE3TQOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

amulyan13 commented 6 months ago

This master is a week old. I will try the latest master.

Below is the call that I am making. nfs_pwrite_async(nfscontexts[contextIdx % NUM_CONNECTIONS], hellofh, WriteData[bufIdx]->buffer, bytes_read /size /, offset, write_callback, WriteData[bufIdx])

This is a single threaded application where I create multiple connections to the server and send the write across these multiple connections created through nfs contexts and wait for all the writes to complete in the callback.

I see this issue when sending large amount of async writes, like basically trying to write say 1GB worth of data.

I will share the packet traces shortly.

amulyan13 commented 6 months ago

writecorrupt.zip

I have uploaded the traces of the corrupted write and its complete tcp stream. xid of the failing request is 0x9a27dbec.

Thanks @sahlberg for looking at this.

amulyan13 commented 6 months ago

Hi, I am unable to attach the packet traces, hence uploaded it to the drive and sharing the link here: https://drive.google.com/file/d/1wxZ9IVg-0LVGXE1_J5iH08ewG_oI6a11/view?usp=drive_link

sahlberg commented 6 months ago

On Wed, 13 Mar 2024 at 14:16, amulyan13 @.***> wrote:

Hi, I am unable to attach the packet traces, hence uploaded it to the drive and sharing the link here:

https://drive.google.com/file/d/1wxZ9IVg-0LVGXE1_J5iH08ewG_oI6a11/view?usp=drive_link

I will have a look.

Can you try using the version 5.0.3, you can just do a "git checkout libnfs-5.0.3" and try it. Note that that version has a slight change to the [p]read nad p[write] signatures so you will need to make some trivia; changes to the source code. But that version is very stable and that is the cut-off before the bug re-factoring to add zero-copy reads.

As you are running a singlethreaded async application, can you make sure that you compile libnfs without multithreading support? If you are building with autotools that would be ./configure ... --disable-pthread and verify that HAVE_MULTITHREADING is commented out in config.h (should be default on linux) On win32 I am not sure exactly how to disable it but I think removing it from win32/libnfs/libnfs.vcxproj should do the trick. This will remove the multithreading support completelty which you need in a singlethreaded application and it will increase performance very sligthly since it removes the conditionals to perform locking.

Another question is, why are you using som many different contexts if you are single threaded and using the async api? nfscontexts[contextIdx % NUM_CONNECTIONS] On a single threaded async application you should be fine just using a single context even with very high concurrency. there are people that do over a hundred thousand concurrent async i/o on a single socket. You will basically be limited by nic bandwidth read/write to socket system calls. Can you try running with just a single context?

Performance tip for later, once we fix the pdu corruption bug: To handle a large amount of concurrent i/o libnfs internally uses a hash-map for xids of commands in flight. Each has entry has linked list of xids in flight. the default is 4 hashes/chains but this can be changed using rpc_set_hash_size() see the docs for his. These hash chains are straversed in two situations. Every time we receive a reply, we traverse the chain that the reply xid hashed to. If the rpc_set_hash_size() is too small and you have an enormous amount of i/o in flight then this linked list will be ling and it will affect performance scanning it to find a match. The other place where we raverse these is during the once a second scan of pdu timeouts. If you have a huge amount of chains but very low concurrency we will still burn a lot of cycles stepping through a huge number of empty chains which will show up in profiling. It is hard to say what the optimal should be but maybe a hash size of 1/10 of the amount of i/o you expect to have in flight at any given time. I.e. if you expect to do 200000 concurrent i/o at a time, a hash size of 20000 might be suitable. That is more for application tuning once we fix the bug with the corrupted pdus. But it is important to mention since a size of 4 a good default for a causal application it is way to small if you want to saturate your network.

—

Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/445#issuecomment-1993433854, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EAWIHDOMHCAIFSWJGDYX7HLBAVCNFSM6AAAAABERULS66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJTGQZTGOBVGQ . You are receiving this because you were mentioned.Message ID: @.***>

amulyan13 commented 6 months ago

Thanks @sahlberg Attaching my sample program here. copylibnfs.txt

Another question is, why are you using som many different contexts if you are single threaded and using the async api?

We want to overcome the network bandwidth, we want to have something similar to nconnect option where we can open multiple connections to the server.

As you are running a singlethreaded async application, can you make sure that you compile libnfs without multithreading support? All the writes that I make are done through a single thread, but in order to poll and dispatch the events I have another thread which keeps running to check for any events on the context. I had to do this as I noticed that the async calls that I make is not sent over the network immediately and waits almost till I close the file handle.

sahlberg commented 6 months ago

On Wed, 13 Mar 2024 at 18:24, amulyan13 @.***> wrote:

Thanks @sahlberg Attaching my sample program here. copylibnfs.txt

Another question is, why are you using som many different contexts if you are single threaded and using the async api?

We want to overcome the network bandwidth, we want to have something similar to nconnect option where we can open multiple connections to the server.

I see. I don't think this technique will have any performance benefit on a single-threaded client and libnfs. The difference with the linux kernel nfs client is that the kernel client is both blocking and multithreaded. The way that works is that threads / processes from userspace calls into the kernel where they go through the vfs layer and ends up in the kernel nfs client. At this point this (kernel) tread creates a PDU, queues it for dispatch and then the thread blocks until a reply is received back and the thread is woken up. All the actual I/o in the kernel nfs client is serviced via a dedicated kernel multiplexing thread that does all I/O to from the socket. This multilex thread takes PDUs that are queued for dispatch and writes them to the wire then it reads the replies back from the socket, identifies which thread is blocked waiting for this xid and then wakes the thread.

What nconnect does is that it creates multiple parallell tcp connections with one dedicated multiplexing thread for each tcp connection. Thus allowing multiple cores to read/write to a (different) socket concurrently. I.e. multiple concurrent multiplex threads. This can help throughput if you are bounded by the amount of data you can read/write to a socket from a single core.

With a single threaded application and libnfs I doubt it will be beneficial since with a single thread you can only read/write to one socket at a time anyway. I.e. multiple connections would not really benefit for a single threaded application. I am happy to be proven wrong but I doubt that this approach will have any performance benefit.

What you can do is that you could have a multithreaded application and then you could have one thread-private libnfs context for each thread. N threads and each thread has its own private libnfs context. That would work and it would allow your application to benefit from reading/writing to (different) sockets from concurrent threads. As long as the contexts are thread private you would not even need to build libnfs with multithreading support, that is only required if you want to share one context across multiple concurrent threads.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

sahlberg commented 6 months ago

I see, I am pretty confident this is the issue here. Your app is not single threaded but has a separate thread to service the socket and invoke callbacks. So your app is basically multithreaded as far as libnfs is concerned.

Most of the time, all socket i/o is done by the event loop, calling rpc_service(). But there is one exception, As an optimization, If we are enqueuing a new RPC call and the outqueue is completely empty, then instead of just enqueueing it and then waiting until we return to the eventloop so that it can be written to the socket in the net loop, in this situation we bypass the eventloop and start writing it straight to the socket right away.

See here, rpc_queue_pdu() ... if (rpc->outqueue.head == pdu) { if (rpc->multithreading_enabled) { nfs_mt_mutex_unlock(&rpc->rpc_mutex); } rpc_write_to_socket(rpc); <======= } else { ...

This is an optimisation to reduce latency on often/mostly idle contexts, writing the pdu straight away to the socket instead of delaying writing it to the socket for tens or more microseconds. It is safe to do this in a pure single threaded event driven application.

This can however race with if you run the event-loop in a separate thread and cause the event-loop invoking rpc_service() collide with this call I point to above leading to two rpc pdus being interleaved written to the socket.

The event drive async api was not designed to be used in this way. It was designed to only be used on pure single threaded applications.

Try to comment out this call to rpc_write_to_socket(). This will avoid this kind of interleaved colliding writes to same socket but it might just surface other issues with lack of locking of updates to critical regions. It might work, it should avoid the corrupted pdus on the wire but I suspect it might cause it to crash in some other way instead. I did not envision a single/multi-threaded hybrid like this so I am not super surprised that things might not work completely. If you switch to a pure single threaded application then I am certain things will work for you. I don;t know if that is viable for your application or not.

Let me think about this over the coming weekend. I should be able to add support for this specific usecase. Two threads, one running an event-loop and which might queue new PDUs from the callbacks it invokes) and a master thread that just issues new PDUs. It should be fairly easy for me to add but thread locking is delicate so it is not something I can rush.

At least we know why the corrupted PDUs happens so that it the first step to a solution I guess.

Actually, looking at it more, for THIS particular program I think it will work if you just comment out the call to rpc_write_to_socket() I indicated above. This is because as far as I can tell you don't (yet) make any new _async() calls from within the callbacks that are invoked from the event-loop. But that is a fairly restrictive limitation on the use of dual-thread applications like this so I probably need to try to make the dual-thread case safe regardless. Let me work on it this weekend. For now you can just try commenting put the call abobe.

On Wed, 13 Mar 2024 at 18:24, amulyan13 @.***> wrote:

Thanks @sahlberg Attaching my sample program here. copylibnfs.txt

Another question is, why are you using som many different contexts if you are single threaded and using the async api?

We want to overcome the network bandwidth, we want to have something similar to nconnect option where we can open multiple connections to the server.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

amulyan13 commented 6 months ago

Hi @sahlberg I tried commenting out rpc_write_to_socket(rpc) call as suggested but I still see the issue.

With a single threaded application and libnfs I doubt it will be beneficial since with a single thread you can only read/write to one socket at a time anyway.

So what do you suggest is the right approach here? I basically want the writes to be sent over multiple connections without waiting for any write to complete. Basically the idea is to send out the write pdu to the socket as soon as I call the write_async*() API so that I have good amount of writes waiting at the server to be processed. I want the write packets to be flushed out to wire as soon as the write call is made from the application.

I was thinking of creating N threads for N connections and then each having their own nfs contexts and I call the write_async* API on this, but this call will not put the data on the wire till we don't poll for the event and then call nfs_service on this in a loop. This will make it an almost sync call which is not desired. Do you think using the raw APIs will help here. Appreciate your suggestion on this.

sahlberg commented 5 months ago

I have created and uploaded a simple example program to show the async interface for pwrite. examples/nfs-writefile.c

It is single threaded and writes a file onto the server using async pwrite calls. By default it tries to do 1024 writes concurrently and keep the pipe full at all times. The write size defaults to 64kb but you can change both in the sources to test.

For very large files to copy to a server, maybe setting the writesize to 1Mb and then just dial up the mac_concurrency until the network link is saturated.

(The biggest cost to performance is likely the cost to read() the local file into a buffer that is then passed to libnfs. mmap()ing the file and use pointers straight into the mapped area for the local file would perform better but would not be portable.)

Example: I used a 100MByte file and copied it to the server using this command line: ./examples/nfs-writefile 100M nfs://10.10.10.11/data/SNAP-4/f1/100mb

On Wed, 13 Mar 2024 at 23:36, amulyan13 @.***> wrote:

Hi @sahlberg https://github.com/sahlberg I tried commenting out rpc_write_to_socket(rpc) call as suggested but I still see the issue.

With a single threaded application and libnfs I doubt it will be beneficial since with a single thread you can only read/write to one socket at a time anyway.

So what do you suggest is the right approach here? I basically want the writes to be sent over multiple connections without waiting for any write to complete. Basically the idea is to send out the write pdu to the socket as soon as I call the write_async*() API so that I have good amount of writes waiting at the server to be processed. I want the write packets to be flushed out to wire as soon as the write call is made from the application.

I was thinking of creating N threads for N connections and then each having their own nfs contexts and I call the write_async* API on this, but this call will not put the data on the wire till we don't poll for the event and then call nfs_service on this in a loop. This will make it an almost sync call which is not desired. Do you think using the raw APIs will help here. Appreciate your suggestion on this.

— Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/445#issuecomment-1994423024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EGEGXWRDRARQMFZVYTYYBI47AVCNFSM6AAAAABERULS66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGQZDGMBSGQ . You are receiving this because you were mentioned.Message ID: @.***>

amulyan13 commented 5 months ago

Hi @sahlberg , Thanks for sharing the test program. With the above program I just have 1 connection going out to the server whereas I want multiple connections created to the server as I do not want the n/w bandwidth to be limiting. I am attaching the sample program that I am trying where I have multiple connections created using multiple nfs contexts. I have one main thread which reads from the local file and queues the write request. I spawn 'x' number of other threads where each thread just monitor 1 nfs contexts and it reads and writes to the socket for that connection. This was the actual dispatch to the socket is done by 1 thread for a given connection and the actual queuing to this socket is done by the other thread. This will help us overcome the network bandwidth limitation. But this is causing write failure issues as I mentioned initially where the writes are failing with 'garbage args' error or it fails with 'No byte written' error and hence I am unable to get this to work consistently.

But that is a fairly restrictive limitation on the use of dual-thread applications like this so I probably need to try to make the dual-thread case safe regardless. Let me work on it this weekend. Were you able to add this support? I want to give it a try if this issue is fixed.

copylibnfs.txt

sahlberg commented 5 months ago

On Mon, 18 Mar 2024 at 22:39, amulyan13 @.***> wrote:

Hi @sahlberg https://github.com/sahlberg , Thanks for sharing the test program. With the above program I just have 1 connection going out to the server whereas I want multiple connections created to the server as I do not want the n/w bandwidth to be limiting. I am attaching the sample program that I am trying where I have multiple connections created using multiple nfs contexts. I have one main thread which reads from the local file and queues the write request. I spawn 'x' number of other threads where each thread just monitor 1 nfs contexts and it reads and writes to the socket for that connection. This was the actual dispatch to the socket is done by 1 thread for a given connection and the actual queuing to this socket is done by the other thread. This will help us overcome the network bandwidth limitation. But this is causing write failure issues as I mentioned initially where the writes are failing with 'garbage args' error or it fails with 'No byte written' error and hence I am unable to get this to work consistently.

-

But that is a fairly restrictive limitation on the use of dual-thread applications like this so I probably need to try to make the dual-thread case safe regardless. Let me work on it this weekend.

-

Were you able to add this support? I want to give it a try if this issue is fixed.

I will send you an example application that uses n threads spread across m contexts to write a file to a server coming weekend.

copylibnfs.txt https://github.com/sahlberg/libnfs/files/14636217/copylibnfs.txt

— Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/445#issuecomment-2003803718, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EBUK2SWQFUZ5E2PBTLYY3N7RAVCNFSM6AAAAABERULS66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBTHAYDGNZRHA . You are receiving this because you were mentioned.Message ID: @.***>

linuxsmiths commented 5 months ago

@sahlberg, I've created a PR with the fix for the RPC corruption issue discussed here.

https://github.com/sahlberg/libnfs/pull/448

With this fix, I was able to successfully copy multiple large files (100s of 100+GB files) using the nfs-writefile.c sample program. There was a memory leak in that program. Will send another PR for that.

Pls review and merge as appropriate.

sahlberg commented 5 months ago

Please try current master. I fixed a bug in rpc_read_from_socket() where it used a global/static variable which caused issue when using multiple different contexts from different threads.

See also the new examples/nfs-pthreads-writefile.c example. This is a program that shows the multithreading using multuple contexts, each with its own service thread and then creates and uses threads to perform writes to a file using the syn api. This should be similar to how nconnections=4 using the linux kernel client works. 4 different service/multiplex threads that mostly/only do reads/writes to the corresponding socket and then normal worker threads that do synchronous I/O using them.

Create, say a 1Gb file and use this example to copy to the server. By default it creates 4 service threads/contexts and it rotates the worker threads across these 4 contexts. By default it will create one worker thread for each 10MB chunk of the file, so it will create 103 threads to write the full 1GB file, spread across the 4 contexts.

This example uses the sync API and pthreads. I did think about doing it with the async/event-driven API but am not sure about that approach. Non-blocking async event-driven designs and threads based sync designs are two orthogonal ways to increase concurrency and I am not sure if it is a good idea to mix them. I could look over the async interface if you want to use it instead of the sync interface with this multithreaded approach.

This should provide pretty good performance. the more worker threads you create the higher the concurrency becomes and the higher the throughput becomes. How to decide on the number of contexts and service threads? These threads should never block or sleep aside from poll/select waiting for the socket to become readable/writeable so they should be mostly bounded by cpu and memory bandwidth. reading/writing memory to/from the socket. So if you have instrumentation where you can see these threads cpu utilization that could help. If they are permanently stuck at near 100% you may need to increase the number of contexts/threads. My gut-feeling is that each thread should be able to do several GByte/s in throughput before they are maxed out on a modern CPU. So unlikely you should need very many of them but I am interested to see performance and cpu utilization numbers if you collect them.

On Sat, 23 Mar 2024 at 17:10, linuxsmiths @.***> wrote:

@sahlberg https://github.com/sahlberg, I've created a PR with the fix for the RPC corruption issue discussed here.

448 https://github.com/sahlberg/libnfs/pull/448

With this fix, I was able to successfully copy multiple large files (100s of 100+GB files) using the nfs-writefile.c sample program. There was a memory leak in that program. Will send another PR for that.

Pls review and merge as appropriate.

— Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/445#issuecomment-2016389792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EHD6PHUTW7CAZHDALLYZUTFXAVCNFSM6AAAAABERULS66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGM4DSNZZGI . You are receiving this because you were mentioned.Message ID: @.***>

sahlberg commented 5 months ago

Please have a look at the new example: nfs-pthreads-async-writefile

This uses 4 contexts to handle socket i/o and creates multiple worker threads to each copy a 10MByte region of the file to be written to the server. Each worker thread sends all its corresponding async pwrite calls, with no flow-control, and then just waits for all the writes to have completed and all the callbacks have been executed.

This will create an absolutely enormous amount of concurrent I/O and should saturate any network links. I would be interested to hear about the highest sustained bandwidth you can keep with this and at what cpu utilization.

On Thu, 28 Mar 2024 at 15:55, ronnie sahlberg @.***> wrote:

Please try current master. I fixed a bug in rpc_read_from_socket() where it used a global/static variable which caused issue when using multiple different contexts from different threads.

See also the new examples/nfs-pthreads-writefile.c example. This is a program that shows the multithreading using multuple contexts, each with its own service thread and then creates and uses threads to perform writes to a file using the syn api. This should be similar to how nconnections=4 using the linux kernel client works. 4 different service/multiplex threads that mostly/only do reads/writes to the corresponding socket and then normal worker threads that do synchronous I/O using them.

Create, say a 1Gb file and use this example to copy to the server. By default it creates 4 service threads/contexts and it rotates the worker threads across these 4 contexts. By default it will create one worker thread for each 10MB chunk of the file, so it will create 103 threads to write the full 1GB file, spread across the 4 contexts.

This example uses the sync API and pthreads. I did think about doing it with the async/event-driven API but am not sure about that approach. Non-blocking async event-driven designs and threads based sync designs are two orthogonal ways to increase concurrency and I am not sure if it is a good idea to mix them. I could look over the async interface if you want to use it instead of the sync interface with this multithreaded approach.

This should provide pretty good performance. the more worker threads you create the higher the concurrency becomes and the higher the throughput becomes. How to decide on the number of contexts and service threads? These threads should never block or sleep aside from poll/select waiting for the socket to become readable/writeable so they should be mostly bounded by cpu and memory bandwidth. reading/writing memory to/from the socket. So if you have instrumentation where you can see these threads cpu utilization that could help. If they are permanently stuck at near 100% you may need to increase the number of contexts/threads. My gut-feeling is that each thread should be able to do several GByte/s in throughput before they are maxed out on a modern CPU. So unlikely you should need very many of them but I am interested to see performance and cpu utilization numbers if you collect them.

On Sat, 23 Mar 2024 at 17:10, linuxsmiths @.***> wrote:

@sahlberg https://github.com/sahlberg, I've created a PR with the fix for the RPC corruption issue discussed here.

448 https://github.com/sahlberg/libnfs/pull/448

With this fix, I was able to successfully copy multiple large files (100s of 100+GB files) using the nfs-writefile.c sample program. There was a memory leak in that program. Will send another PR for that.

Pls review and merge as appropriate.

— Reply to this email directly, view it on GitHub https://github.com/sahlberg/libnfs/issues/445#issuecomment-2016389792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADY3EHD6PHUTW7CAZHDALLYZUTFXAVCNFSM6AAAAABERULS66VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGM4DSNZZGI . You are receiving this because you were mentioned.Message ID: @.***>

sahlberg commented 5 months ago

Async write issue has been resolved and this issue has moved on to more development of new features. Closing this issue as the corrupted writes are resolved.

We can continue talking about futher improvements via email.

I was thinking about maybe adding splice() support on linux to eliminate the read/write to socket for cases where we just want to copy data between a open file descriptor and an nfs file. that could potentially reduce the cpu load on the service threads quite a bit for pure "copy file to/from nfs share". If you are interested in that direction and a new api like nfs_pread_splice[_async](..., inf fd, size_t count,... ) let me know.