If Event table is empty during a MySQL restart, auto_increment is reset and Huginn stops working

jplitza commented 7 years ago

Problem

On a very low-volume instance, at some point, events simply stop to be processed. They are generated, but not propagated along links, even if the links are re-created.

Cause

Links and Agents store which event ID they last processed. On a low-volume instance with events being deleted after some time, it may happen that the events table becomes empty. If the MySQL server is then restarted and the table uses InnoDB (the default?), the auto_increment column starts at 1 again. Thus, many events that are subsequently created are ignored by links and agents because they think they already processed them. Only if enough events are generated (before the table is empty and the server is restarted again), they will be noticed again.

Solutions

Never delete the event with the highest ID
Put a warning next to the field where you choose the retention period
Avoid/discourage the use of InnoDB for the events table

dsander commented 7 years ago

Hi @jplitza,

thanks for reporting the issue. That sounds like very 'interesting' behavior of InnoDB. I think we should migrate the events table to MyISAM, does anyone know any downsides?

cantino commented 7 years ago

Wow, that's crazy. You're totally correct. I had no idea. In general, InnoDB is more reliable than MyISAM, and has better transaction support. It'd be a shame to lose that.

It's a hack, but maybe we could check ID when we add a new Event?

e.g.

def build_event(event)
  event = events.build(event) if event.is_a?(Hash)
  event.agent = self
  event.user = user
  event.expires_at ||= new_event_expiration_date
  event.id = Agent.maximum(:last_checked_event_id) + 1 if Event.exists?
  event
end

jplitza commented 7 years ago

@cantino You probably know better, but the code snippet you posted sounds like problems could still occur when the current (originating?) agent's last_checked_event_id is lower than the receiving agent's.

I haven't had a look at the code yet, but couldn't you "simply" add a fake event that will not really be processed by anything ever, just before pruning the events table? It may have an expiration date of +1m or something, but won't be purged by the current run (and the next will add its own fake event).

dsander commented 7 years ago

@jplitza Agent.maximum(:last_checked_event_id) is the largest Event id of all Agents in the database, I think that might work.

@cantino Wouldn't we need to that especially for new events? As far as I remember setting the id like that does not work. Not sure if the model only needs to be persisted first or if only update_attribute works, but I remember having issues messing with the id. Probably because one normally should not do it 😄

An other option would be to update the last_checked_event_id column of all Agents after purging the events.

  def self.cleanup_expired!
    affected_agents = Event.expired.group("agent_id").pluck(:agent_id)
    Event.transaction do
      Event.expired.delete_all
      max_id = Event.maximum(:id)
      Agent.where(id: affected_agents)
           .update_all ["last_checked_event_id = ?,events_count = (select count(*) from events where agent_id = agents.id)", max_id]
    end
  end

cantino commented 7 years ago

@dsander, I think that would only be safe if you were emptying the Event table, right? Otherwise you could cause an Agent to skip over events it hadn't seen yet?

I'm not sure about setting id. I'd hope it would work, as it's valid from the DB's perspective.

MingShyanWei commented 7 years ago

Keep the unprocessed events in the event table only.

When "propagate!" or "immediately propagate". Select all event belong this agent from event table , process, and then move to events_log.

Agent no count the maximum event id.

dsander commented 7 years ago

@cantino Right, the query would have to be specific for every Agent. It could probably be done in SQL as well but would be more complex. We could also change cleanup_expired! to only expire events that have been processed.

  scope :expired, lambda {
    where("expires_at IS NOT NULL AND expires_at < ? AND id > ?", Time.now, Agent.active.min(:last_checked_event_id))
  }

But that would change unrelated logic

@MingShyanWei While in general that could work it is not really practical for Huginn because it would require a lot of changes.

cantino commented 7 years ago

We'd want to only look at last_checked_event_id on Agents that could receive the Event. It is complex. Maybe we should have a special placeholder event with some flag on it that we never delete or propagate? What a pain.

MingShyanWei commented 7 years ago

@cantino see eye to eye with sb.

dsander commented 7 years ago

@cantino We would need to move that event to a new id every time cleanup expired is run, right? Feels a bit hacky, but definitely more efficient then updating every Agent's last_checked_event_id

What do you all think about switching the manual installation guide and the multi-process docker image to postgres?

cantino commented 7 years ago

I'd be totally fine with switching to Postgres.

cantino commented 7 years ago

I also ran Huginn without issue on Mysql for years, so, while this is a problem, it's not a frequent one.

knu commented 7 years ago

What about always keeping the latest event?

diff --git a/app/models/event.rb b/app/models/event.rb
index 410bc8b8..d8809cea 100644
--- a/app/models/event.rb
+++ b/app/models/event.rb
@@ -25,7 +25,7 @@ class Event < ActiveRecord::Base
   after_create :possibly_propagate

   scope :expired, lambda {
-    where("expires_at IS NOT NULL AND expires_at < ?", Time.now)
+    where("expires_at IS NOT NULL AND expires_at < ?", Time.now).where.not(id: maximum(:id))
   }

   scope :with_location, -> {

cantino commented 7 years ago

I think that might work for the case of automatic cleanup, which would be a big step forward.

There are still other ways that the Event with the highest ID could get deleted, like if the user delets the Agent that created the highest Event, or if they delete the Event manually in the UI, but your solution would still fix the common case of automatic event deletion.

Should we add this?

knu commented 7 years ago

OK, so the correct workaround would be to detect when the events table is empty as timely as possible and reset last_checked_event_id of all agents. Fortunately, there are only a few places in Huginn where events may get empty, so it won't be hard to do that, if I'm not missing something.

diff --git a/app/controllers/agents_controller.rb b/app/controllers/agents_controller.rb
index 1ae90763..8205f06a 100644
--- a/app/controllers/agents_controller.rb
+++ b/app/controllers/agents_controller.rb
@@ -210,6 +210,7 @@ class AgentsController < ApplicationController

   def destroy_undefined
     current_user.undefined_agents.destroy_all
+    Event.reset_agents_if_empty

     redirect_back "All undefined Agents have been deleted."
   end
diff --git a/app/models/agent.rb b/app/models/agent.rb
index aea2e11b..bdfd58c5 100644
--- a/app/models/agent.rb
+++ b/app/models/agent.rb
@@ -42,6 +42,7 @@ class Agent < ActiveRecord::Base
   before_save :unschedule_if_cannot_schedule
   before_create :set_last_checked_event_id
   after_save :possibly_update_event_expirations
+  after_destroy { Event.reset_agents_if_empty }

   belongs_to :user, :inverse_of => :agents
   belongs_to :service, :inverse_of => :agents, optional: true
diff --git a/app/models/event.rb b/app/models/event.rb
index 410bc8b8..ee19c9a9 100644
--- a/app/models/event.rb
+++ b/app/models/event.rb
@@ -23,6 +23,7 @@ class Event < ActiveRecord::Base

   after_create :update_agent_last_event_at
   after_create :possibly_propagate
+  after_destroy { Event.reset_agents_if_empty }

   scope :expired, lambda {
     where("expires_at IS NOT NULL AND expires_at < ?", Time.now)
@@ -74,9 +75,16 @@ class Event < ActiveRecord::Base
   def self.cleanup_expired!
     affected_agents = Event.expired.group("agent_id").pluck(:agent_id)
     Event.expired.delete_all
+    Event.reset_agents_if_empty
     Agent.where(:id => affected_agents).update_all "events_count = (select count(*) from events where agent_id = agents.id)"
   end

+  def self.reset_agents_if_empty
+    if !exists?
+      Agent.where.not(last_checked_event_id: nil).update_all(last_checked_event_id: nil)
+    end
+  end
+
   protected

   def update_agent_last_event_at
diff --git a/lib/agent_runner.rb b/lib/agent_runner.rb
index 815f0692..bac84ee8 100644
--- a/lib/agent_runner.rb
+++ b/lib/agent_runner.rb
@@ -35,6 +35,8 @@ class AgentRunner
   end

   def run
+    Event.reset_agents_if_empty
+
     @running = true
     run_workers

cantino commented 7 years ago

It looks like MySQL effectively sets the auto-increment counter to SELECT MAX(ai_col) FROM table_name on restart. So, imagine that the Events table has a bunch of Events in it, then the newest set of events get expired and removed, but some old Events still remain (because their creating Agent isn't set to expire). Now, if the server is restarted, the events table won't be empty, but the auto-increment value would decrease to the current max(id) in the table. This would break last_checked_event_id values in the same way as when the auto-increment goes to 0 on an empty table, right?

So, I think the line of reasoning from your original solution may be a better one.

dsander commented 7 years ago

@knu I like your first suggestion.

Isn't your reset_agents_if_empty function really reset_agents_if_not_empty? I think that suffers the same race condition @cantino described here

jplitza commented 7 years ago

I think the correct thing would be the reset_agents_if_empty approach, but setting the last_checked_event_id to maximum(:id). Right?

dsander commented 7 years ago

@jplitza We can't to that because it can lead to Events not being processed by the receiving Agent. Imagine an Event being emited, then we clean up the expired Events (and set last_checked_event_id) now the receiving Agent will never process the new Event.

jplitza commented 7 years ago

minimum(maximum(:id), :last_checked_event_id) then.

cantino commented 7 years ago

This... is hard to think about. 😄 @jplitza, do you see any issues with @knu's current solution in #1974?

dsander commented 7 years ago

@jplitza Sadly it's not that easy we discussed it here: https://github.com/cantino/huginn/issues/1940#issuecomment-287651996 https://github.com/cantino/huginn/issues/1940#issuecomment-287662409 https://github.com/cantino/huginn/issues/1940#issuecomment-288034168

jplitza commented 7 years ago

No, my proposal doesn't cause agents to skip events. I'm not confident with Rails, so let me give you plain SQL (probably not with the right column and table names though):

UPDATE agents SET last_checked_event_id = MIN(last_checked_event_id, (SELECT MAX(id) FROM events));

This will

never increase the last_checked_event_id, thus never skip any already present, yet unprocessed events.
lower the last_checked_event_id – if necessary – to the highest event id present in the database, thus not ignoring any newly added events.

dsander commented 7 years ago

@jplitza Imagine the Event cleanup runs after an Event was created but before the receiving Agent was able to process it. The SQL query will update the Agents last_checked_event_id and never process the Event that was created before the cleanup.

jplitza commented 7 years ago

No, it won't. Example:

The agent has last_checked_event_id=123
An event that should be received by this agent is created with ID 170
The cleanup runs, deleting all events up to including for example 157
- It sets last_checked_event_id=min(123,max(id))=min(123,170)=123 for our example agent
The agent runs and processes event 170

On the other hand:

The latest event has ID 123 and is about to expire
The agent has last_checked_event_id=123
The cleanup runs, deleting the event with ID 123
- It sets last_checked_event_id=min(123,max(id))<123
The next event might have ID 124 if the MySQL server isn't restarted, or something lower than that if it was restarted. However, it will always have a bigger ID than last_checked_event_id.

dsander commented 7 years ago

@jplitza I overlooked the min,max logic. I think you are right and your solution should cover all situations properly. There is the potential for a race condition when the event cleanup runs at the exact time when Agent#receive! is somewhere here, but I don't think it would cause problems. The Events processed in receive! will always be recent enough to not be deleted by the event cleanup.

@knu @cantino Could you double check? If it indeed fixes the issue the I feel it would be a clean solution than #1974

knu commented 7 years ago

The problem is where to hook and execute the UPDATE statement. Should we do it every time a database connection is established, and is that enough?

A rather minor implementation issue I found out is that min(a, b) is not a valid SQL expression. 😆 There's least(a, b) that's available in both MySQL and PostgreSQL, but they differ in how to deal with null values, so you should use the case when construct here instead, which is part of SQL standards.

erebor commented 4 years ago

@knu Should this issue be closed now that #2014 is merged?

huginn / huginn