stevecope / mqtt-data-logger

Logs MQTT Data to a file
MIT License
28 stars 21 forks source link

Strategic Vision #5

Open PythonLinks opened 6 years ago

PythonLinks commented 6 years ago

Strategic Vision

There is a big problem in MQTT data logging. If you store all of your data by topic, the disk heads jump around on writing. It does not scale well.

On the other hand if you store all of your data in one file, and then try to read the data by topic, the disk heads jump around on reading. It does not scale well.

I should also point out that there are two application areas. One is for IoT data logging. The other is for chat servers. I am interested in trees of chat servers. Where the user wants the recent messages to load really fast. The needs for IoT may be different.

What is one to do? I have been scratching my head on this problem for a year now. I am the guy who created the irst data logger repository for Steve in March of 2018. I think I finally figured out the answer.

Of course you can use the Kafka message broker. That is for big data, lots of servers, lots of redundancy. They do not want to loose a single byte of data. Me, I just want a simple small solution for one server. Plus maybe Kafka does not know about hierarchy.

And I do not want to write lots of software. As far as possible I want to reuse solid stable existing packages. Like this one!

My key insight is that there are both hard drive based file systems, and RAM based file systems. Often the /tmp directory in Linux is a RAM based file system. Plus /tmp now swaps when needed. Perfect.

So what should we do? One could write all of the messages to one file on the hard drive, and also write each message to a separate file by topic in the RAM-based file system.

If the server crashes, no problem, just read the data from the hard drive, repopulate the tmp drives files, and all is good.

What happens when your data files get too large? Well the current software does log rotation. You can extend the concept. When you rotate the hard drive log, at the same time rotate the ram logs.
And then save the rotated ram logs to the hard drive.

What if your server crashes just when that is happening? That would be a very rare event. You can always recreate the topic log files from the bulk log file.

I just do not think this is big problem. I am not in the corporate Kafka space of every piece of data is sacred. It does not take long to write out the temporary files. Also one could first write the ram logs, then rotate and rewrite them. The problem will not happen often. I don’t think my ISP servers have every crashed. Not much data would be lost. I am okay with loosing some chat data. And if you really care, one can always recreate the topic logs.

So what does this mean for this topic logger? We really should have one logger, currently we have one for logging to one file, and one for logging by topic.

The topic logger should be able to both log to a single hard drive file, and to simultaneously log to multiple topic-based RAM-based files.

On rotation, it should rotate both the hard-drive file, and the topic files, and then save the topic ram logs to the hard drive. Better yet, on rotation, to protect against a crash, it could first save the topic logs, rotate, and save them again.

What if some topic logs are very small. Maybe, on rotation, it could merge an old topic log, with a newer topic log.

After a crash, it should be able to read the hard drive file and recreate the RAM files.

For testing, it should be able to run without MQTT, generate random messages and store and rotate them.

I think it might even be nice if it could read the log file and send those messages to the MQTT broker.

To summarize. I would love a logger that is able to read from MQTT, from a random generator, and from the hard drive. It should be able to write to the hard drive and to the ram log files, maybe even to MQTT. It should do rotation as described above.

I know that I am asking for a lot. I do appreciate this free software. Eventually I may write this. But maybe someone has a more urgent need than I do.

How does that sound? Did I miss anything? What do the IoT applications need? Your feedback would be most appreciated.

What if you need a quick fix? In the short run, there is a quick fix I know of. Talk to me if you are interested. But in the long run, this is the solution I want.

kamocat commented 2 years ago

I think the best solution is to use a full-fledged database backend (such as postgresql). Databases are optimized for random access, where file structures aren't. If you still want to keep it file-based, how about sqlite?