Some thoughts to the general design

Hello Peter, thanks a lot for your efforts! I am not completely done with the code yet but I'd like to share some thoughts with you:

Qos/Guaranteed package delivery: The connection might be resilient but as you state in the README there is no guaranteed package delivery. This would have to be implemented in the App which is not a good idea in my opinion as you would have to implement it in every possible App manually or write a general app on top anyways. The common usecase is unlikely to accept package loss and handling package loss in every app needs more effort and also results in code bloat. Therefore I'd suggest implementing QoS directly into the driver
Authentication: It would be unwise to have a local server without any authentication other than sending a client_id. I'd recommend having at least a simple authentication using a username and password. I know it's not that safe as the connection is not encrypted and the data could be spoofed easily in local lan. At least it prevents the easiest ways of doing damage.
Session: This seems to be handled in the App. In mqtt you have the option to connect clean or reuse the previous session before the wifi outage possibly resulting in a load of new messages. If the app does not handle this, the server would flood the client with new messages and crash it. This makes it either neccessary to implement some session management or the server would have to dump messages or the client takes its chances. Another option would be number 4, at least for short outages. If the outage is longer then a clean session is needed anyways as after a few hours there could be hundreds of messages.
Flow control: As the main target of this driver are microcontrollers with limited RAM it could be interesting to implement a flow control. This could be done in a very easy way, bascially like having a socket lock with a timeout:
- server sends data to client
- qos handshake or similar between the two to ensure package delivery
- client signals the server when it is ready to accept new packages while server waits a configurable timeout This prevents crashing the client because of new messages and also gives it more control over the speed of processing messages. The value for this could be configured by the client on login.
Generic App Interface: The current implementation seems to need one client object per App. This is ok if you only use one App but how do you differ between mqtt messages, a "get" request or more? Spawning several clients has a huge overhead and would consume a lot of RAM. The solution would be to make the driver more generic by sending an additional header including an App-ID. Then server and client both know, where the message belongs to and use only one socket and driver instance for communication. An App then easily registers its callback for new messages and gets a basic set of functions for writing to the socket. This saves RAM and makes it very modular so you easily spawn a temporary App for some "GET" requests. The server would recognize the the app name and spawn a new server app for this with a uniqe App-ID. The exchanged messages would then be similar to mqtt having a bytes header first, something like: app_name (numeric), app_id, message_id, length, qos, duplicate
Cancellables I understand the use of cancellables and I did not test the difference in RAM usage between using it and having an alternative but by looking at the code, I assume that it uses a lot of RAM. Having an alternative should make the driver less RAM intense and the code easier. But again, this is just a thought as I do not have actually tested it yet.
Comparison to your mqtt_driver In the README you state that this driver uses less code than the mqtt implementation which is correct but in my opinion a weak comparison as the mqtt implementation has a lot more features and is therefore naturally more complex.

I'd be interested in your thoughts about this. I'm not sure my feedback fits the direction you were headed with the driver. I'd happily contribute to the driver and write a server-side app for mqtt once the driver protocol is decided.

It would be unwise to have a local server without any authentication other than sending a client_id. I'd recommend having at least a simple authentication using a username and password. I know it's not that safe as the connection is not encrypted and the data could be spoofed easily in local lan. At least it prevents the easiest ways of doing damage.

In one of my projects, I got around this limitation by placing a passphrase known to the server on the device. Any data received by the device had a sha1 checksum appended to it, the checksum was sha1(data + passphrase). If the checksum failed (i.e. an attacker tried to send something to the device), it was discarded. This way, no authentication details were passed over clear text. You would need physical access to the device to get the passphrase off of it, but by then you could just re-flash it with whatever you wanted, so I don't see that as a limitation really.

In my use case -- sending short messages related to PIR or ultrasonic sensors, packet loss is OK, but crashes / lockups are unacceptable.

One worry I have is whether all of the wifi disconnect/reconnects will cause wear on the external flash.

My two cents worth -- implement a tiny, yet robust example of one client and one server -- highlight how to solve the socket/wifi issues, without obscuring those critical details with a lot of extra features that could be implemented any number of ways.

@craftyguy that's an interesting idea but maybe a little bit of an overkill on an esp8266 with its limited resources. But could certainly be implemented as an advanced feature.

@bvwelch disconnect/reconnects don't write to the flash so I see no way it will wear on the flash. I agree, the most important goal is to have a minimal resilient code. That way this code can also easily be changed if another platform needs a different implementation. That code can then be extended in a subclass adding new features which are the same for every resilient device implementation.

@kevinkk525 My overriding aim was to minimise RAM use on the ESP8266 and to simplify the code as far as possible. I accept that it is not secure against attackers who have gained access to the LAN but I think the architecture is much more secure against internet attacks compared to running web-facing applications on the ESP8266. I wanted it to work without use of cross-compilation or frozen bytecode. I have not achieved the former. Using daily/release builds free RAM is on the order of 6KB/10KB - I feel this RAM should be for application code rather than for adding too many features. In particular I'm conscious of the RAM requirements of some sensor device drivers. Also of the fact that FBC may be a step too far for some users.

I did investigate some of the things you suggested.

Eliminating task cancellation. I went some way down that route before realising that there were use cases where the code would fail. Fixing this would probably take as much code as using task cancellation so I reverted.
Supporting QOS > 0. This started to look complex and I formed the view that QOS == 2 is readily achievable in the application as I suggest in section 7.2. I plan to write and test a demo for this.
Flow control. This is a good idea, but as soon as you start looking at sending ACK packets you're heading into qos > 0 territory. Sending ACKs sounds simple until you tackle the issue of ACKs that fail to be delivered. The question is whether to add the complexity and reduce free RAM? Barring errors on my part the server cannot crash the client: the latter's _reader coro runs continuously. If a badly designed client application fails to accept messages at the rate at which they are sent, some messages will be lost.

I'm afraid I don't follow some of your suggestions.

Session management. There isn't really a "session" concept as program state at both ends is maintained forever. The system is intended for 24/7 operation and will survive outages. But if the local power supply fails, all state is lost. I have no plans to fix this: I'd advocate batteries as the solution to that one :)
Generic App interface: Either my documentation or my understanding of your point is awry here as I don't follow what you mean by "spawning clients". From the point of view of the user code on the server there is one Connection instance. This comes into being when the specific ESP8266 device first connects; that instance persists forever through outages. There is also a one to one correspondence between application instances and ESP8266 devices. So in my intended use no new objects are created at the server side. I don't follow your references to GET or MQTT messages. I would expect the ESP8266 to send simple data such as sensor readings or switch states. The server application would convert these to HTTP or MQTT using Python libraries such as Paho.

I think I need to write some more demos. I think flow control and qos==2 are readily achievable at application level, but until I write some code it's just hand-waving.

MQTT

I was acutely aware of the risk of re-inventing resilient MQTT. We have MQTT :) My aim was to design something which is as minimal as I could make it. I wish I could simplify it further.

My comparison with resilient MQTT is perhaps unwise. Resilient MQTT remains as a solution. The code in this repo is primarily intended for people who want to run other protocols and is therefore protocol-agnostic. But you could run Paho on the server app if you wanted to.

Where next?

So the big question is, do we add features or try to minimise further? Your suggestion of adding features by subclassing seems to me to be the answer.

I also think I need to improve the docs and demos to clarify how I envisage this library being used.

Thanks for your answer. Not having to use FBC would be nice but on the esp8266 that is a really tough goal.

Eliminating task cancellation: Thanks for trying. Could you tell me in which use case it would fail without it?
Your QoS 2 approach in 7.2 seems reasonable with both ends using message_ids. This should be the preferred way for resilient communication as it is more transparent to the user. Qos 1 or 0 are not needed.
Flow control: There is no need for complex code here. It could be as simple as sending a simple READY package back after the line has been processed. If the server does not get a READY within 500ms, it just assumes that the client is ready to accept new packages. But you are right about the server not being able to crash the client with too many messages as the client application has to take care of the reading interval and if that is too slow, messages are dropped. I think I was too focused on the concept of continuos reading with callbacks. Still when trying to not lose messages, a simple flow control could help.
Session management: Looks like I did not explain it properly, part of the problem was probably my thinking about callbacks in 3. With my new understanding it means that a wifi outage of a few minutes could just result in a lot of messages being lost as the client can't handle all new messages if the server does not wait long enough between sending messages. This was not about retaining a state with the need for batteries. The server just has to handle the case correctly where the client is offline for 3 hours because manual intervention was needed (or an update). What happens once the client reconnects then? Will he get the hundreds of messages waiting for him, that the server got because it's a mqtt proxy?
Generic App Interface: ok I'll try to explain it more carefully. There is one connection instance but you have 2 apps on your client. 1 app is an mqtt client, 1 app is used for "GET" requests. I know this is a bit more high level than you intended but it's what I thought about when seeing "App" in the code. Now both apps send data to the server and the server sends a response back. But how do server and client distinguish between messages from App 1 and App 2? There'd need to be an additional layer that puts a header into the messages, so server and client know which App has to handle the message. Just as a side question: How would a simple status message look like? I can't imagine something simpler than what mqtt does like sending a short topic "switch/status" and payload "ON".
Message separator: This just came to be an hour ago: You separate messages by reading lines but what if an answer from the server contains newline symbols? This probably can't be prevented when using GET requests as you have no control over what you get from the internet as it depends on the content.

The task cancellation use-case is where an outage occurs when the client application is spending time doing nothing. The outage is detected by the lack of a keepalive. The other coros must be interrupted to enable the recovery process to begin. In the absence of cancellation the tasks did not terminate until the user code attempted to communicate.

Re message loss I think you haven't fully grasped how the code works. If a long outage occurs, messages aren't routinely lost. The write method (on both client and server side) pauses until the outage ends. A single message can be lost if it is sent in the interval between the outage occurring and being detected. If write is called with pause=False it might be possible to lose more than one in that interval. In typical running (even in the presence long outages) messages are rarely lost.

If the client is offline for hours and the server application has accumulated many messages for it, that is really an application-level problem. It could send them or combine them in some way,

Again if the server-side application supports multiple internet protocols the application will have to arbitrate or combine them. This complexity has to be handled somewhere. My contention is that this should be done on the server in CPython. The application must do this and produce messages for each client. The messages are then sent to the appropriate Connection instance. My library simply provides a resilient full-duplex stream interface.

You are correct in stating that newlines in the middle of messages are problematic. A message with such a newline will be received as two messages. The line-oriented protocol was chosen because MicroPython supports a socket readline method: using this saves code on the ESP8266. In all the use scenarios I've considered, a message would consist of a JSON encoded Python object. As I understand it, such messages will not contain newlines:

>>> a = ujson.dumps([1, 'rats\nrats'])
>>> '\n' in a
False
>>> a
'[1, "rats\\nrats"]'
>>>

I dug a little deeper into the cancellation as I really wanted it gone :D What is wrong with this code compared to using cancellables? In my short tests it works as good as the cancellables but also frees 3kB of RAM as FBC after removing everything cancellable related from "primitives.py". (Code is from client.py, line 84. The decorators "@asyn.cancellable" are removed too)

    async def _run(self, loop):
        s = self._sta_if
        while True:
            while not s.isconnected():  # Try until stable for 2*.timeout
                await self._connect(s)
            self.verbose and print('WiFi OK')
            self.sock = socket.socket()
            self.evfail.clear()
            try:
                self.sock.connect(self.server)
                self.sock.setblocking(False)
                await self._send(MY_ID)  # Can throw OSError
            except OSError:
                pass
            else:
                # loop.create_task(asyn.Cancellable(self._reader)())
                # loop.create_task(asyn.Cancellable(self._writer)())
                # loop.create_task(asyn.Cancellable(self._keepalive)())
                _reader = self._reader()
                loop.create_task(_reader)
                _writer = self._writer()
                loop.create_task(_writer)
                _keepalive = self._keepalive()
                loop.create_task(_keepalive)
                await self.evfail  # Pause until something goes wrong
                self.ok = False
                # await asyn.Cancellable.cancel_all()
                asyncio.cancel(_reader)
                asyncio.cancel(_writer)
                asyncio.cancel(_keepalive)
                await asyncio.sleep(1)  # wait 1 sec so that all coros are removed from loop.waitq
                self.close()  # Close sockets
                self.verbose and print('Fail detected.')
            s.disconnect()
            await asyncio.sleep(1)
            while s.isconnected():
                await asyncio.sleep(1)

You are right about message loss, my concern was about getting more messages than can be processed after a reconnect and therefore losing some messages here. I understand that the accumulation of many messages is a high level app problem just like the support for multiple protocols. Guess I should start writing application code then.

This is a different (better) approach than the one I tried and rejected which involved testing flags and quitting.

I'll study it further tomorrow but at a first glance it looks good and a worthwhile improvement. I just need to convince myself that there is no circumstance where, at the end of the 1 second delay, a cancelled coro might still be pending. The guarantee that all have actually terminated is the thing that cancel_all brings to the party.

Re message handling the assumption on the client is that there is a user coro which continuously reads: messages on the ESP8266 aren't buffered. On the server side they are. (I need to document this). But typical applications on both sides will have a coro which spends most of its time waiting for messages.

Re qos: I've written some code and done some testing of an application which implements qos==2 by the algorithm proposed in the README. It's hard to prove a negative but I experienced no missing messages through numerous outages. Duplicates do occur (as expected) but are discarded. The fact that this algorithm looks OK was an unplanned side effect of the way I defined the interface but it's very much simpler than using ACK packets. I'll document and post the code, but unless there is a subtle bug I'm convinced it should be done at application level.

Now pushed an update including the improvement from @kevinkk525, also a bugfix to server_cp.py. Thank you Kevin.

Thanks a lot for the update! btw: you have a spelling error in README line 142 Will close this issue now and open new ones if needed.

peterhinch / micropython-iot

Some thoughts to the general design #1

MQTT

Where next?