servalproject / serval-dna

The Serval Project's core daemon that implements Distributed Numbering Architecture (DNA), MDP, VoMP, Rhizome, MeshMS, etc.
http://servalproject.org
Other
171 stars 80 forks source link

MDP sockets support #39

Closed rom1v closed 11 years ago

rom1v commented 11 years ago

It is not possible to use several "MDP sockets" with the API: there are no "sockets" exposed (in fact, there is only one per app, as mdp_client_socket is an external variable in mdp_client.c).

This is a problem, because the overlay_mdp_recv() function just retrieve the "first" packet, whatever it is. If it is not the right port, it is ignored (but the thread which listened to that port will never receive it, it is dropped).

Therefore, there is no way to "multiplex" services. If a service wants to listen on port x and another on port y, some packets from x will be "received" by the recv function of y (which will just drop it, but in that case this packet is lost for no reason and the recv() function considers it has received a packet which is invalid).

Moreover, functions like overlay_mdp_getmyaddr() or overlay_mdp_bind() do call overlay_mdp_send(), which can call overlay_mdp_recv(). The received response could be one packet from another device, which has nothing to do with the request, so it will be dropped. The easiest way to see the problem is to send a packet to yourself (to your own SID), and follow what happens in debug mode.

To fix it, I changed a bit the MDP API and implementation for providing "MDP sockets" support (commit f55e2b7):

int overlay_mdp_client_socket(void);
int overlay_mdp_client_close(int mdp_sockfd);
int overlay_mdp_client_poll(int mdp_sockfd, time_ms_t timeout_ms);
int overlay_mdp_recv(int mdp_sockfd, overlay_mdp_frame *mdp, int port, int *ttl);
int overlay_mdp_send(int mdp_sockfd, overlay_mdp_frame *mdp,int flags,int timeout_ms);

overlay_mdp_client_init() is replaced by overlay_mdp_client_socket(void). Instead of returning 0 or -1, it returns the socket descriptor or -1. overlay_mdp_client_done() is replaced by overlay_mdp_client_close(int mdp_sockfd).

Each function now takes a parameter mdp_sockfd. This mdp_sockfd is created by overlay_mdp_client_socket(). There is no global mdp_client_socket anymore in mdp_client.c: instead, each service creates and uses its own MDP socket.

Then, I adapted usage in commandline.c.

Before that, I needed to make a tiny change: make overlay_mdp_client_socket_path local to overlay_mdp_client_init() (now renamed overlay_mdp_client_socket() for consistency with socket API). It was easy, because the unlink in overlay_mdp_client_done() was useless (it was already done in overlay_mdp_client_init()). (commit 263232b)

Quite independently, for making it work, I also needed to make overlay_mdp_recv() blocking. (commit 9d44386)

What I tested after these changes:

What I did not tested:

Please give me any feedback.


Tip: for some functions, you should use git diff -w for ignoring whitespace differences (for example if I removed a if() {} condition, the indentation has changed).

lakeman commented 11 years ago

So each binding requires a separate socket connection? I guess that should work without too much impact to the existing code.

Long term I would have preferred to build a local port binding lookup table, with handler function for processing each payload. But that will require rewriting each command line operation to work in an event driven manner.

However most of the tests in the tests/dnaprotocol script are failing to resolve phone numbers.

rom1v commented 11 years ago

So each binding requires a separate socket connection? I guess that should work without too much impact to the existing code.

Yes, that way we can provide a socket-like API to the client easily.

Long term I would have preferred to build a local port binding lookup table, with handler function for processing each payload. But that will require rewriting each command line operation to work in an event driven manner.

What would be the benefits?

However most of the tests in the tests/dnaprotocol script are failing to resolve phone numbers.

Thank you for your feedback. I just committed a fix (4b273b4).

rom@rompc:~/Documents/serval-workspace/batphone@development/jni/serval-dna@mdpsock$ tests/dnaprotocol
1. Lookup by wildcard... PASS
2. Lookup by empty string... PASS
3. Lookup non-existent phone number... PASS
4. Lookup local phone number... PASS
5. Lookup remote phone number... PASS
6. Node info auto-resolves for local identities... PASS
7. Node info resolves remote identities... PASS
8. Lookup phone number three nodes reply... PASS
9. Lookup phone number two nodes reply... PASS
10. Lookup phone number one node replies... PASS
10 tests, 10 pass, 0 fail, 0 error
rom1v commented 11 years ago

I just added MDP JNI bindings native-part.

It provides the implementation of native methods of MeshSocket API. (java-part is here: https://github.com/servalproject/batphone/pull/51)

rom1v commented 11 years ago

I merged the current development branch.

https://groups.google.com/d/msg/serval-project-developers/bebngkEXesI/801JB958dZEJ

One minor concern I have is overlay_mdp_recv() becoming blocking. Because servald is designed to be single-threaded, we need to be able to support fully asynchronous operation. As servald may end up using MDP socket services as a client (eg to talk to other servald instances on other nodes to exchange various information), we need a non-blocking solution. It might just be a flag passed in to overlay_mdp_recv() to tell it to be non-blocking, so that we have that option for when it is needed.

Using the same pattern there was in mdp_client.c, the code would become (+ possibly a flag for enabling/disabling it) :

int overlay_mdp_send(int mdp_sockfd, overlay_mdp_frame *mdp, int flags, int timeout_ms)
{
  //...
  set_nonblock(mdp_sockfd);
  int result=sendto(mdp_sockfd, mdp, len, 0, (struct sockaddr *)&name, sizeof(struct sockaddr_un));
  set_block(mdp_sockfd);
  //...
}
int overlay_mdp_recv(int mdp_sockfd, overlay_mdp_frame *mdp, int port, int *ttl)
{
  //...
  set_nonblock(mdp_sockfd);
  ssize_t len = recvwithttl(mdp_sockfd,(unsigned char *)mdp, sizeof(overlay_mdp_frame),ttl,recvaddr,&recvaddrlen);
  set_block(mdp_sockfd);
  //...
}

Since my commits providing "MDP sockets", mdp_sockfd is given as a parameter (instead of being a global variable), so I think it is the responsability of the caller to mark the socket as blocking/nonblocking, and manage what to do when non blocking and errno == EAGAIN || errno == EWOULDBLOCK.

Possibly, the blocking/nonblocking mode could be a parameter of the MDP socket init function overlay_mdp_client_socket.

What do you think ?

quixotique commented 11 years ago

The best way for a library to support different main-loop strategies (blocking/non-blocking, threaded, coroutines, callbacks, etc.) is to simply get out of the way and let the caller set up what it needs to.

In this case, the MDP client library has no business calling set_nonblock() and set_block() on the socket. The caller should do that. The library already exposes the socket's file descriptor as the mdp_sockfd parameter, so this is trivial for the caller to do. The library could expose the set_block() and set_nonblock() calls (in fact, macros) as part of its API as a convenience for the caller, but the caller can always just use the fcntl(2) sys call directly.

What the MDP library should do is to make sure that it handles the EAGAIN (and EWOULDBLOCK if present) errors as normal conditions, ie, without logging an error or even warning. See the definitions of the _write_nonblock() and _read_nonblock() functions for an example of what this means. In fact, recvwithttl() appears to already do this. It does not treat EAGAIN or EWOULDBLOCK as errors, and handles the case of len == -1 the same as len == 0.

rom1v commented 11 years ago

I totally agree with your analysis.

I also noticed a little side effect: every time a socket is created, it creates a new file in /data/data/org.servalproject/var/serval-node/. But when the socket is closed, the file is not deleted. This is not very important because when a collision occurs, we delete the file before binding a new socket, but there is no reason for keeping so many files like this:

mdp-client-2113-d4b10929.socket
mdp-client-2113-a812c682.socket
mdp-client-2113-e689b1f8.socket
mdp-client-2113-0c0916fb.socket
mdp-client-2113-5807c93e.socket
mdp-client-2113-734d5e29.socket
mdp-client-2113-8a0a67fa.socket
mdp-client-2113-386b4def.socket
mdp-client-2113-7cf2c673.socket
...

Therefore, I just commited some changes for keeping the mapping mdp_sockfd/sun_path, and removing the file on socket close (commit 92a28a2). If you see a better way to do that, please do not hesitate.

quixotique commented 11 years ago

As well as deleting temporary socket files on close, it is important to also clean up stale temporary files on start or at regular intervals. Our code does not always follow this policy because it is of experimental or prototype quality, but we are deliberately moving towards production quality code, so we need to start insisting on this kind of design principle in new code.

In fact, I suggest that, as a matter of design policy, unlink-on-close should possibly never be performed, instead we should always rely on the clean-up logic to remove stale files. This would ensure that testing can easily reveal that it is working, instead of waiting for exceptional events to discover that it is not.

Our policy for operation on any platform should always be to explicitly design (and test for) for power-off/SIGSEGV/SIGKILL events and long-term, unattended operation. So we can't afford to let temporary files accumulate (or unused database rows accumulate, or log files accumulate, etc.).

rom1v commented 11 years ago

As well as deleting temporary socket files on close, it is important to also clean up stale temporary files on start or at regular intervals.

You're right. I just implemented a clean up of all socket files on servald start (commit db2fefc).

Maybe we should clean up more (what else?). Maybe it is dangerous to clean up all socket files (will we need a socket persistent across servald restarts?). But it seems ok.

In fact, I suggest that, as a matter of design policy, unlink-on-close should possibly never be performed, instead we should always rely on the clean-up logic to remove stale files.

In my opinion, we should also keep the unlink-on-close policy (commit 92a28a2), because we can have a lot of socket files during a servald "session": every single command creates a MDP socket (and closes it).

rom1v commented 11 years ago

@lakeman You closed the pull request?

lakeman commented 11 years ago

Not deliberately. I deleted "master" as part of cleaning up our branch names, so I guess I got the blame for making the pull impossible. I should be able to find time to look at this patch properly next week.

rom1v commented 11 years ago

I just merged development branch into mdpsock (so mdpsock is up to date).

Unfortunately, github does not see it, but I think it is not important.

rom1v commented 11 years ago

I added Java bindings for flags and QoS management. https://github.com/rom1v/serval-dna/commit/8454944653d0a2ab8dd833d93104d35bbc632f8a https://github.com/servalproject/batphone/pull/51#commit-cfb9be76188124a245170f545362f9f2035f6b51

rom1v commented 11 years ago

Due to this pull request unintentional close, I opened a new one: https://github.com/servalproject/serval-dna/pull/53