mqtt to kronometrix concepts

sparvu commented 5 years ago

Core Concepts

Kronometrix MQTT Databus (MQTT KBUS)

A1. Kronometrix SID, TID, DSID, DEVID are all MQTT KBUS concepts. They have nothing to do with MQTT nor be part of any MQTT topics

A2. We should keep apart MQTT communication and not force to change, create or alter MQTT topics based on SID, TID, DEVICEID etc. which are Kronometrix internal concepts.

A3. MQTT KBUS must deploy its own version of the MQTT client, preferable async non-blocking capable to submit to one or many topics to a MQTT broker

A4. The MQTT KBUS must be capable to receive via MQTT other client's topics as soon as the other clients are publishing something on the TCP line

A5. The MQTT KBUS could theoretical publish some topics too or at this stage it simple just receive data from the other clients, when available.

MQTT to Kronometrix Data Mapping

B1. MQTT KBUS must have a way to define and configure, the following items:

the MQTT broker address, IP and port number
the MQTT authentication settings, if the MQTT broker is using authentication or SSL
the Kronometrix platform, SID, TID where the MQTT data shall be published

B2. The Kronometrix DSID, DEVICEID must be detected from the MQTT clients or overwritten by the MQTT KBUS itself. There can be the following cases

For each MQTT client we generate a new DSID to be used on the K platform . This is the default mode
For each MQTT client, we can ask the IP via MQTT and based on that compute the DSID
- As long as we know the DSID we can decide if there are devices associated with the MQTT client or not. In general a MQTT client has many topics and contains many physical devices

DTPopa commented 5 years ago

Let's discuss about the timestamp. Some devices are not capable of sending data, others do measure data and timestamp it themselves. What I have seen on other MQTT platforms is that they store the timestamp of the data and/or the timestamp when the data has been received. My proposal is to store both, and in case the device is not capable to send data, to assign the time of receipt to the data. You tell me please if this is a valid proposal.

sparvu commented 5 years ago

About time: yes, shortly these are the main lines and we will follow them. We need to clarify first some ground concepts before jumping to time

irimiab commented 5 years ago

Let's clarify what's a "client": an MQTT client is an app which connects to the MQTT broker on a topic (or multiple topics) and can publish or simply wait for other clients to publish.

So our "MQTT KBUS" is an MQTT client. It simply connects to a topic (the topic can be defined with wildcards, in which case you can say it subscribes to multiple topics) and just waits for messages from other clients.

The "MQTT KBUS" client receives from the broker, when a message is published, only the following information:

the topic under which the message has been published
the payload of the message (or, in other words, the message itself)

No other information is available: not the publisher's ID, nor its IP, nothing!

Now, when you say

For each MQTT client we generate a new DSID to be used on the K platform

what do you mean by "MQTT client"? You mean an instance of "MQTT BUS", or a sender? And, either way, how is the IP important?

sparvu commented 5 years ago

will call shortly about these. we need to clarify

what is a data source id in this case: MQTT client to Kronometrix DSID
what is a device id
what means to see traffic from other MQTT clients

sparvu commented 5 years ago

So here are some ground rules I understood from these guys which are using MQTT for some time. Might help us.

we should have clear topics, no hidden agendas or weird naming conventions. topics must be clear and have no ids, key logs etc
the payload could contain a JSON where we could use a clientid as needed by our system, in our case the databus
the payload could have a simple structure containing the the metrics, some other informations and if needed a ClientID which can be a SHA512 etc
the clients must always be authenticated and authorised . it is very important to have a broker which can do these and allow SSL for secure communication from day 0

In general the payload is the way to propagate extra information between clients if needed. SSL always must be used along with authentication

sparvu commented 5 years ago

So, to resume:

on the broker - the most intelligent and powerful way would be to have the databus implemented on the broker itself, but that means we should build our own broker which is not simple and applicable right now
on the client - the next immediate thing is to make the databus outside the broker, as we are discussing now on a MQTT client which would require we ask on the body payload to have always some minimum requirements:
- ClientID: "XXXX" a string which can by anything
- We can't enforce a rule to say that the ClientID must be already a SHA256 some devices might not be capable to produce this
- We can make the DSID within the databus on our side as simple as SHA256 or SHA512(ClientID, 'MQTT', 'SID')

But the real problems are these:

The only exception would be what happens if the ClientID is entirely missing in the body payload
Some clients might not want to change their payload format to JSON
Some clients might not want to change and add to their payload the ClientID

sparvu commented 5 years ago

The most logic and powerful way to handle this is in the broker itself: in a form of a plugin or module to handle the MQTT load. This way we have access to all clients, and we can easily produce and convert traffic to Kronometrix from MQTT. But I do not know any form of Lua based MQTT broker nor any of our team members knows Erlang or is familiar with MQTT broker concepts.

On the other side: on the client we could always have some min requirements where we ask no matter of the payload type: JSON, XML, etc a ClientID string which must contain a unique string to identify the clients.

irimiab commented 5 years ago

I don't think that modifying an MQTT broker would be wise; the MQTT itself is just a transport layer, it shouldn't do anything else than distributing messages between clients.

If we want to accommodate in Kronometrix various clients, let's find out about them: how do these clients communicate with each other, what protocol, what details.

From the little research I've done regarding other "analytics" platforms (or, rather, IoT platforms) that use MQTT, this is their architecture:

they use their own broker in order to implement authentication using the same credentials as on the platform; we can do this if we install Mosquitto near the "Auth" Redis database
they require a specific structure for the topic; most of them include the client ID in the topic path
some of them require sending each parameter to a specific topic, as a numerical value. For example, to send the air temperature, you would have to send the number to a topic like this: prefix/<client_id>/last/ta
other platforms use JSON payloads. You send a JSON payload with all the required details.
most of them attach the timestamp to the data automatically. The attached timestamp is the timestamp when the data has been received. The smarter ones allow sending a timestamp by yourself; if it is missing, the timestamp when the data has been received is attached. These are the ones using JSON
most of them don't require SSL

Don't imagine that I studied dozens of cases 😃 I just did a little research on a couple of them (Wia, Elastic Search, Watson and a few others I can't remember now).

sparvu commented 5 years ago

Remember the topic: we discuss about how one would capture data from n MQTT clients, convert these to Kronometrix messages for analysis.

If you have studied then you already know that MQTT is a vast topic where you can have unlimited use cases, type of payloads, clients etc. So there cannot be a single solution to drive almost all cases unless:

A. you attack the problem within the broker itself where you can for example extend the functionality by allowing the convert the messages towards our platform. You do not need to touch any MQTT functionality but you add on it for our own purpose

B. you keep the logic on yet another MQTT client and use some recommendations as already mentioned above. the payload is the most simple aspect which can turned in our favour.

C. Security is a very important topic on MQTT which requires attention from day 1.

So we just need to review and chose A, B, C and carry on with the plan.

sparvu commented 5 years ago

Regarding the brokers, this I have reviewed a bit last year. Some recommendations and good alternatives were: https://vernemq.com https://github.com/emqx/emqx

The best to my findings were Emqx which is enough powerful and has a flexible monitoring part done in VueJS.

irimiab commented 5 years ago

Now I understand why you mentioned Erlang :) Both these brokers are written in Erlang.

Emqx seems nice, indeed.

Regarding the A, B, C options, I already stated my opinion: no point in modifying the "transport" layer (this is the MQTT broker); it makes more sense to implement the logic in an MQTT client (option B). To ensure security, I also proposed to use our own broker installation that integrates Redis AUTH database for authenticating Kronometrix users.

sparvu commented 5 years ago

We can as simple as that, select B and dive into the payload requirements. Sounds go to me.

Regarding the broker: emqx probable is the best broker out there which supports authentication over SSL. All inside. No need for anything.

sparvu commented 5 years ago

So lets focus on B. Where we process and try to identify the MQTT clients using a MQTT client. So here clarifications:

MQTT clients can have set on the topic or payload a custom string which can define the client id. There are no rules and cases might be different
There can be any string format which can define the ClientID
It can show up in topic or boy payload

So based on this our databus first must have:

a simple way to configure and identify the clientid topic or payload
a string which should be used to detect what is the client id. Can be ClientID="xxx", MotherboardID="xxx-xxx-xxxx", MachineID etc We need somewhere to define the keyword from where we shall parse the clientid
when we know the clientid we compute the DSID

thats the first phase. MQTT Clientid to DSID parsing

irimiab commented 5 years ago

Ok. To extract an ID from the topic we can use a regular expression. But to extract it from the payload it is more complicated, because the payload can be multiline, can have various formats, can even be binary. How do you see the configuration for this "client ID" extraction? (you know use use JSON for settings).

sparvu commented 5 years ago

we can always start with topic followed by payload when we have a usecase. we need the way to differentiate under a config where we define how and from where we fetch the client id string.

On the payload, what is the hard part to search for a string and find its value ? the content, u just search body of text, one line or multi-line etc ... I dont see a problem with that.

irimiab commented 5 years ago

Ok. What are the action points for this?

sparvu commented 5 years ago

Build a simple configuration where we can define how we shall identify the client id.
Allow two options: topic | payload
If configured as payload, return a string error on the logs not implemented yet
For topic, define a way to parse and find the client id using a string defined under configuration file and make the DS mapping based on the simplest method: DSID = SHA256 (ClientID, MQTT, SID)

irimiab commented 5 years ago

The SID will be set in the configuration too? And by MQTT, you mean the URL of the MQTT broker?

sparvu commented 5 years ago

yes.

https://github.com/kronometrix/mqtt/issues/4#issue-424883616

MQTT to Kronometrix Data Mapping

B1. MQTT KBUS must have a way to define and configure, the following items:

the MQTT broker address, IP and port number
the MQTT authentication settings, if the MQTT broker is using authentication or SSL
the Kronometrix platform, SID, TID where the MQTT data shall be published

irimiab commented 5 years ago

Done. This is the proposed configuration structure:

local kronometrix = {
    {
        host = "127.0.0.1",
        port = 80,
        path = "/api/private/send_data",
        sid = "9ee583c7d0a8b314c947dccfdcd922ca", -- Computer Performance
        tid = "d5e077bb7d043f5bd93391d283072e1d"
    }
}

local mqtt = {
    server = "37.187.106.16",
    topic = "krmx/+/send_data",
    client_id_source = "topic",
    client_id_regexp = "krmx/(%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x)/send_data"
}

Multiple Kronometrix destinations can be defined. To extract the client ID from the topic, you need to specify the regular expression (Lua-like) in the key client_id_regexp The DSID is now generated using SHA256 based on the client ID extracted, MQTT server URL and the SID.

sparvu commented 5 years ago

client_id_x make them simpler clientid_xxx

client_id_regexp = "krmx/(%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x%x)/send_data"

what is this ?

irimiab commented 5 years ago

That's a regular expression. It means "32 hexadecimal digits". It can be any regular expression (Lua-like).

sparvu commented 5 years ago

lets do like this: pls input some sample example if you plan to use technical regex expressions within configurations files. these are not selling. nobody buys %x

I have nothing against the use of regex but pls document with 3-5 samples within config ... otherwise change that in something which can be used for sales.

irimiab commented 5 years ago

Are you asking me to document Lua regular expressions? :) This is the documentation page: https://www.lua.org/pil/20.2.html A couple of examples:

clientid_regexp = "prefix/(.+)/suffix" -- all characters in the topic "path" between a prefix and a suffix
clientid_regexp = "prefix/(%w+)$" -- all alphanumeric characters in the topic "path" between a prefix and the end of the string
clientid_regexp = "prefix/(%d%d%d)$" -- three digits at the end of the topic string

If you find it too complicated and "will not sell", please propose something else.

sparvu commented 5 years ago

you could use some 2,3 examples, short not long, how one would understand what to look for. like here: https://gist.github.com/nerdsrescueme/1237767 - For example the first one would be looking for something as simple as ClientID="client09-machine" which will allow us to fetch client09-machine

the main idea is that the default values or whatever you keep in the config by default must look concise and simple not ugly (even if that is a legit regex construct).

irimiab commented 5 years ago

So my examples above are too complex?

sparvu commented 5 years ago

keep your examples followed by a practical example. use ClientID as a main example ...

irimiab commented 5 years ago

Hello. Can we make some progress on this?

irimiab commented 5 years ago

So after a long discussion about MQTT, here are some aspects as I see them:

we can make a parallel between HTTP and MQTT as transport layers used to manipulate arbitrary data between 2 endpoints. I consider them similar, with 2 main differences:
- MQTT is lighter, which makes it more fit for short messages and for small devices (low power)
- MQTT is of type "broadcast", whereas HTTP is peer-to-peer (better said client-server)
in my opinion, in relation to Kronometrix, the property of MQTT being "broadcast" is of no importance (because we listen on a single endpoint and we don't use this multi-point aspect in any way)
the aspect of MQTT being a lighter protocol indeed could mean it is more used on less smart devices (than HTTP); in this case, one could argument we should expect less intelligence from the device so we should require less information from it. Moreover, these small devices, being less smart, are less accessible, so in a case of a change how do you access them?
on the other hand, in my opinion, since MQTT devices are connected to the network, they aren't "less accessible" than HTTP clients. And including and updating some simple data in them (like SID, TID, DSID, MESSAGE ID etc.) should be straightforward, because any MQTT-able device should allow configuring some specific topics on which they send data. The topic could include all necessary information (like TID, maybe SID etc.). And this is indeed similar with what other platforms expect (see my comment above)
if we consider MQTT clients really dumb and "rigid" devices, we will try to introduce a lot of intelligence in the "databus". This, in my opinion, has a couple of disadvantages:
- because the MQTT "databus" communicates only one way with Kronometrix (from the databus to Kronometrix), it will not be able to react to changes in Kronometrix. What if I want to add a subscription in Kronometrix? How will the databus know of this subscription? It won't, manual intervention will be needed (to update the configuration of the "databus")
- how will we decide what devices will the MQTT "databus" be able to understand? From buses to weather stations, they all might be different. Shouldn't they be brought to some common ground if we want to integrate them to Kronometrix? Why should we "translate" them as they are and not imposing a set of "good understanding rules"? This, too, is what other platforms do (see again my comment above)

As a bottom line, in my opinion:

HTTP is a transport protocol that Kronometrix uses natively and for which it offers an API
MQTT is a transport protocol for which we would need a translator to HTTP, so the clients be able to use the same API (or a subpart of it). No other intelligence should be included in the "databus". (This is exactly what we are doing in the AviMet case too)

Simple and straightforward.

irimiab commented 5 years ago

To be more specific, I would propose this format:

TID, SID, DSID, DEVID, MESSAGE ID in the topic: /krmx/5556789e2b06f2018859c0bc1d93bea1/b6ac411d5960dabfb804f94577a3cd0f/my_dsid/my_dev/iaqd-g01

payload as JSON:

{
"timestamp":1554129158,
"ta":22.7,
"rh":52.7,
"td":-3.9,
"co2":632,
"voc":480
}

We could move the "message id" from the topic to the payload.

In the payload, timestamp can be missing, in which case the current timestamp (when the message has been received) will be used.

Arguments:

the topic is used for saying where to send data - what subscription, in which user's account, under what datasource
the payload is used for sending the data itself

This is in line with what I've seen on other platforms and it makes sense for me like this.

irimiab commented 5 years ago

I consider my proposal to be a good starting point, quite easy to be adopted by various MQTT-able devices. When specific cases arrive, we might add different functionalities to the MQTT "databus". For instance, to integrate a big client, maybe we could do something more specific for them. But as a general feature set for MQTT, I find my proposal just good.

sparvu commented 5 years ago

ok, some first questions:

if you have 100 MQTT clients, what would it be easier to modify 100 client configurations, or one configuration ?
how do you plan to keep track of DSID, DEVICE ids on each MQTT clients ?
if you have 100 MQTT online clients, functioning, and you add 50 new clients and remove 20, what would be easier: to have a single place where all clients meet and map to Kronometrix or for each client to manage and handle the Kronometrix identifications: DSID, DEVID, etc ?
- how do you plan to protect the TID for solutions which cannot secure the MQTT communication ?

irimiab commented 5 years ago

If the 100 MQTT clients are from the same buyer, then yes, it is simpler to administer them server-side. But if the 100 MQTT clients are sold to 100 people, then we will need to make 100 configurations on the server. And we will have to make these configurations ourselves. Whereas if the configuration is on the device, we can ask the buyers to make their own configuration (of course, with proper tools we need to offer).

As regarding the TID, this is indeed a valid concern which I thought of a little bit last night.

Security

In my opinion, we will need security end-to-end.

If we store the TID on the "databus", any client having access to the MQTT broker will be able to send data to Kronometrix. It's trivial to subscribe to some topics (using wildcards), see what's the protocol, then send whatever data to Kronometrix. In my opinion, this is not acceptable.

I think we need a way to prevent unauthorized clients (clients without a TID) to connect to the MQTT broker. This is in line with what other platforms do.

irimiab commented 5 years ago

Taking a look at EMQ X, I saw it has some nice capabilities regarding ACL and authentication. We can either use authentication (via Redis or via HTTP basic auth), or we can use the ACL to prevent the clients to "sniff" on other client's tokens, then validate the token when they publish.

sparvu commented 5 years ago

Look the big picture. You have 100 MQTT clients, one buyer, 30 whatever buyers etc

If the 100 MQTT clients are from the same buyer, then yes, it is simpler to administer them server-side. But if the 100 MQTT clients are sold to 100 people, then we will need to make 100 configurations on the server. And we will

It is hard to make the modifications in 100 places, no matter these 100 devices, clients come from 1 buyer or 100. You literally have to make 100 modifications to change something which anyway has nothing to do with MQTT. Is not logic.

Instead you can allow your MQTT client, part of the databus product to subscribe to different topics and handle that in a single place nice and easy.

Even in your example, having 100 buyers will increase the risk substantial, that something might get broken when you want your 100 users to change the configs.

Dont you agree ?

irimiab commented 5 years ago

Having to manually provision every new "buyer" isn't a good idea, in my opinion. The buyer will have to make his own account on Kronometrix; why shouldn't he provision his own devices?

sparvu commented 5 years ago

I did not ask about buyer, provisioning, I just ask: what do you think it is easier to handle 100 modifications or only one ? To me the answer is obvious. One. And thats on the databus itself. Dont you agree ?

irimiab commented 5 years ago

Yes, it's easier to handle one modification. But it's easier to handle zero modifications (and to let the owner of the devices to make these modifications).

sparvu commented 5 years ago

ok, we are coming to some consensus. So yes, it is much easier to have in a single place the config for 2 or 100 MQTT clients or 10.000. Now our goal is to establish common grounds on what we are building. A MQTT Databus. Thats what we are after.

I will list here again the top considerations of what is a databus and how to do it.

includes a MQTT client which can subscribe to different topics, on a certain broker
the databus might offer a broker, but this wont be anytime soon now, or calendar 2019. Maybe 2020 we can integrate a broker after somebody is financing this activity. This is a separate track.
when data arrives from one or many MQTT clients, our Kronometrix MQTT Databus MQTT client should fetch the content from other MQTT clients and based on that should pass the content, payload(s) it to the databus itself for Kronometrix conversion and transport towards platform analytics
the platform analytics will not speak MQTT. Nor DDS nor anything else than HTTP 1.0, and future HTTP 2.0. In fact HTTP 2.0 will be a very important part of future analytics for our platform.
so the databus is the bridge between MQTT world and Kronometrix world
to minimize and have the less possible changes on the MQTT front, clients, etc our databus must offer several capabilities:
- to map MQTT clients to Kronometrix DSIDs
- could group or configure them to certain subscriptions if needed
- map MQTT payloads to Kronometrix data messages
- provision data
of course if required, the databus could in fact offer a REST API interface that allows this, offering support to configure via Web the MQTT clients to different Kronometrix subscriptions, etc

sparvu commented 5 years ago

Further clarifications about TID

this is a very important element which must be kept at all costs secured and private
its place must be on the databus, because there it is the most secure place
we cant place the TID on the MQTT client, nor allow end users, or users to configure on their devices from two aspects: simple management, and security
we cannot guarantee that all MQTT solutions will always use a secure communication
we, will not be able to offer our broker anytime soon, 2019
the databus is responsible and manages:
- MQTT communication, using a MQTT client, fetching the MQTT messages
- Kronometrix internal operations: authentication and authorisation, DSID, the data message and provisioning

Therefore I would suggest that the TID as the most other information should sit be manageable on the databus itself. Again if the management is a concern then we can allow and offer a REST API for that.

sparvu commented 5 years ago

So lets review, summarise and conclude what options and path we take and why.

irimiab commented 5 years ago

I still don't understand how will you prevent anybody to send data to Kronometrix via the MQTT broker?

sparvu commented 5 years ago

We will have a max DSID or MQTT clients allowed on the databus. An option which will allow us to say no more than 200 MQTT clients are allowed. These 200 MQTT clients will then be mapped to K DSIDs and processed.

You can think on your time, how we can implement this max control. The Databus must display its configuration and on the logs during the start the number allowed of clients. We can re-use the platform.json as a form of 'licnsing' or whatever else you want to call it where we configure the max clients. I can tonight formalize the databus.json config.

then we need a crypto way to ensure this limit .

sparvu commented 5 years ago

MQTT traffic cannot reach Kronometrix without a databus. And a databus has a capacity and a cost. Like everything else in life.

sparvu commented 5 years ago

Let me know if you still have unclear things. We take them one by one. Some I cant answer how we do it technical but high level design I have the concepts I would love we close the discussion and debates quickly to move to the low level design and implementation.

irimiab commented 5 years ago

And what stops a maleficent user to use another client's ID to send bogus data?

For example, I subscribe to the same broker to all topics. I see what's happening there, then I use an existing client ID and send bogus data. Or even flood. Or I create new bogus clients so the real clients will be rejected (due to the max clients limit).

Basically, you have no protection against "bad people" on MQTT. Anyone can render that subscription unusable, if he intends so.

sparvu commented 5 years ago

ok, some clarifications:

we are not here to fix MQTT. We need to work with it
we need to protect our databus at all costs (max number allowed at one time, maybe some other criteria in place as protection: like clients allowed on the databus based on certain pattern, ids, etc )
the databus must have very clear configuration(s) and a capacity set at start. that capacity should act as a upper limit allowing MQTT clients to be mapped and processed through databus

sparvu commented 5 years ago

and, I hope we understood what we are planning to make:

a MQTT client which subscribes to a MQTT broker
your questions and concerns are more if we run on a broker which does not support authentication, then yes we might receive more traffic from more clients
if the broker supports authentication and SSL our client must support to join too to such thing

I hope these answers u questions. Let me know if you still have unclear things .

irimiab commented 5 years ago

Sure. Let's proceed as you see fit for the project.

irimiab commented 5 years ago

So what's the next step on this one?

kronometrix / mqtt