Open mmangione opened 2 years ago
Hi @mmangione sorry about this, I understand your frustration. I have just checked and I'm not seeing any flush that we are doing in the production code. Are you referring to a specific piece in our code?
@tsteur Hello. I can’t say for sure. I was watching the monitor when I changed the password. It went from 230,000 queued events to 0. I didn’t change the database number (or anything else), so it has to be that those events were flushed from Redis. Usually, you can just remove the keys and it will delete all data. I assumed it was an explicit command, but it might be some other mechanism that accomplishes the same thing.
Did this happen the moment you changed the password in Redis, or when you changed it in Matomo? Was the password maybe also changed somewhere else in case the Redis is used somewhere else too? Just double checking to make sure it's caused by Matomo.
Did you also change any other setting in Matomo Queued Tracking?
When you save the settings, I can't see anything that would cause this. It only tries to connect to Redis and select the configured DB to test the entered credentials and that's all.
@tsteur We changed the password in Redis, then, the workers stopped running. We went from 16 workers to 0. It was like this for a couple of minutes while we logged into the system. It simply repeated the same event count over and over. When we changed the password in Matomo, monitor paused, flashed, and event count fell from 230,000 to 0 - and all of the worker counts dropped to 0. The queues functioned normally again after the reset.
@mmangione could you also answer the other questions like did you change any other setting in Matomo Queued tracking when you changed the password? Was redis maybe ever restarted and then the data got lost?
I haven't tried to reproduce it yet but so far I can't really imagine how Matomo would drop the data there.
Hello @tsteur!
It happened again and we just lost about 1.5M events over about 6 hours.
Here's what happened:
We are running 3 loadbalanced servers and a loadbalanced DB cluster with Redis queued tracking.
We changed the IP address for our primary Matomo DB in the config.php file.
We propagated that config file to our servers.
We restarted nginx, php-fpm, and stunnel4 (which we use for TLS tunneling for Redis).
After the restart we began collecting thousands and thousands of events in our Redis Queued Tracking queue. Normally, our servers stay on top of the processing and we maintain around 200 events in the queues with 16 workers on 25 even clusters at any given time.
Instead of processing the events in real-time, we dropped down to maybe 20 events processed per minute. So, we were accruing thousands of events. The queues started filling up again, but they weren't processing.
Here is a set of printouts when we were monitoring the queues - BEFORE touching anything in the Queued Tracking plugin.
There were no screaming warning logs anywhere. The only error we found was just this message in our php-fpm log:
[02-Jun-2022 03:38:54] WARNING: [pool www] child 416751 said into stderr: "NOTICE: PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 131072 bytes) in /var/www/matomo/vendor/matomo/doctrine-cache-fork/lib/Doctrine/Common/Cache/RedisCache.php on line 52"
[02-Jun-2022 03:39:10] WARNING: [pool www] child 416748 said into stderr: "NOTICE: PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /var/www/matomo/vendor/matomo/doctrine-cache-fork/lib/Doctrine/Common/Cache/RedisCache.php on line 52"
[02-Jun-2022 03:39:27] WARNING: [pool www] child 416751 said into stderr: "NOTICE: PHP message: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in /var/www/matomo/vendor/matomo/doctrine-cache-fork/lib/Doctrine/Common/Cache/RedisCache.php on line 52"
This is odd because our memory limit is set to -1, our max post size is 50M. Nowhere are we set to 128MB in any settings.
We finally decided to try to change the Queue Workers setting in the Matomo Settings. We decreased it from 16 to 8. The queues were purged and we lost all saved events (again...). This is pretty much exactly what happened last time I commented on this bug.
After the change, the queues immediately began filling up again. It was the exact same glacial processing, maybe 20 events/minute.
When we had this problem last time, we took all AUTH off of our internal Redis server. This time, we decided to reverse this process and add some AUTH back.
Adding AUTH, and setting a password through the Matomo Settings UI resolved this issue.
As I write this, our systems are back to processing events and we are maintaining about 200 events in the queue at any given time. But, we found this out accidentally - without any logs or other guidance.
At this point, this is the second time we've lost a very significant amount of events because this tracking plugin decided to stop working after a DB change. The last time, we changed the DB user. This time, we changed the IP address - and somehow that caused us to slow processing events to a crawl, develop a memory leak, then start purging stored events.
We looked everywhere for logs, but we could not find any that are relevant. Whatever was happening was occurring silently and in a way that we couldn't diagnose.
We are running in GCP in their MemoryStore service. I can confirm our Redis server was never restarted nor has it been touched since the last time I had to remove AUTH to get this working again. Nor was our main DB cluster restarted or touched at all.
@tsteur Any thoughts?
@tsteur Pinging again.
Nowhere are we set to 128MB in any settings.
Maybe there is a memory limit in the config for the command line? PHP definitely says there is a memory limit. You could try and run for example php -i | grep memory
on the command line.
At this point, this is the second time we've lost a very significant amount of events because this tracking plugin decided to stop working after a DB change. The last time, we changed the DB user. This time, we changed the IP address - and somehow that caused us to slow processing events to a crawl, develop a memory leak, then start purging stored events.
@mmangione are you also terminating and re-launching all the queued tracking workers after this change? I would recommend to also clear Matomo caches after such a change ./console cache:clear
and then terminate all running workers and launching them again.
I'm personally not working on this feature anymore so I can't really reproduce this issue easily unfortunately.
@tsteur We diagnosed and remediated the issue. This issue was part of a larger issue. We developed a really bad instability where the queues started randomly resetting after upgrade to 4.13.0. Our queues looked absolutely atrocious - and we were losing about 30-40% of events to these resets. Here is a graph of the queues resetting during normal activity. This is under light load.
We enabled the Matomo logging and found this error:
ERROR Piwik\FrontController[2022-12-20 23:55:56 UTC] [4001129] Uncaught exception: /matomo/plugins/QueuedTracking/Queue/Processor.php(244): Rolled back during processing as we no longer have lock or the lock was never acquired. So far tracker processed 1249 requests [Query: , CLI mode: 1] ERROR Piwik\FrontController[2022-12-20 23:57:59 UTC] [4006353] Uncaught exception: /matomo/plugins/QueuedTracking/Queue/Processor.php(244): Rolled back during processing as we no longer have lock or the lock was never acquired. So far tracker processed 368 requests [Query: , CLI mode: 1]
The main issue turned out to be that we were using the same Redis server for both caching and for the queues. We, of course, used different databases for each service. We do this with many of our apps, and we have many apps on a single Redis server without issue.
In this case, there must be an overly broad flush command either in RedisCache or this QueuedTracking plugin. Once we separated our servers and spun up 2 separate servers, all of our issues with configuration, resetting queues, purging queues, and all other forms of instability stopped. I hope this helps someone in the future because we spent months on this, then even purchased Premium Support... all to no avail.
Then the only thing left to do with this issue would maybe to add some info about this in the documentation?
Thanks @mmangione That must have been hard to find. So to just to confirm the issue got resolved by splitting the cache and the queued tracking into two different redis servers and not just two different DBs? Do both DBs and configs share the same credentials and do they have access to both DBs? I'm not too much into Redis so just trying to better understand where things may happen.
Indeed while the cache is doing a flushdb, this plugin does a flushall which would explain the problem.
We basically need to change https://github.com/matomo-org/plugin-QueuedTracking/blob/4.x-dev/Queue/Backend/Redis.php#L250 to flushDB
Actually that method is only used in tests, and not in production code so this won't fix.
@tsteur We use a single Redis server and separate our applications by using different database numbers. The credentials are the same and the auth is the same. Only the fact that reference of the DBs by their number changes.
@tsteur We suspect the issue has been going on since we first setup our instance in 2020. Our visits, visitors, and PPC campaign results were off by (+/-)20% some days (against GA). We could never quite figure out why. We did notice that our queues would hover around 200 - 300 actions to process, and would occasionally drop to 0 - 10, then climb back up immediately.
With 4.13.0, we saw a 10,000% increase in the network traffic of our Redis server. That sudden load stressed every piece of our infrastructure and showed us how many areas needed to be tuned properly.
We saw the queues behaving strangely, but thought it might be normal - it as an issue until 4.13.0 caused huge buildups and dumps of the queues. Without that sudden significant increase in load, I think this issue wasn't have been noticeable.
It was only after we systematically addressed all of the other possible causes and the problem persisted that we thought to reach out to Matomo Support. It was actually a fortuitous off-handed remark by Richard about being surprised that we were "using the same Redis server for both caching and queued tracking" that we even thought to question our architecture. We got very lucky here. It wasn't even on our radar otherwise.
I guess a good lesson here is to stress test all production systems periodically. We have definitely learned that lesson.
Hopefully, this explanation of the process of identifying this issue will be helpful to future devs. If you're reading this in the future, go stress test your setup. Send more and more traffic at it until it fails, then identify what failed and mitigate it. Do it now. If you are having any intermittent issue, this should be your strategy to find the root cause.
@mmangione are you maybe also using Redis for sessions?
I couldn't locate anything yet that would do any flush or explain this. There was one flush, but this is actually only running in tests.
@heurteph-ei I'll update an FAQ for now
@tsteur Sorry for the delay! I didn't see this until I received the notification that you had reopened the issue.
We were using Redis for Matomo's caching and for queued tracking. We were using the same Redis server with different databases, for each system. The problem went completely away after we stood up a new Redis server and edited our configurations.
We just lost 230,000 events of customer data because we changed our Redis queue password from the settings. I can't imagine why one would ever intentionally issue a FLUSH command after a setting is updated. Or, if you knew about this why is not in all caps in that setting card.
We have to periodically cycle passwords for security. We need to be able to do that from our Matomo instance WITHOUT having our system purge our queued events.
That has to be a bug. That behavior is extremely harmful. We are trying to use this system at the scale of 10M hits/day. It's going to take days for us to try to reconstruct the lost data that the queueing system just purged unexpectedly.