peterprib / node-red-contrib-kafka-manager

Implement
GNU General Public License v3.0
22 stars 12 forks source link

"on offsetOutOfRange Error: OffsetOutOfRange" #21

Closed InSupport closed 4 years ago

InSupport commented 4 years ago

I have a application running on my machine in docker, producing data. When I restart said application, I get the "on offsetOutOfRange Error: OffsetOutOfRange" on my group consumer and I need to manually restart the node from the node red UI.

A entry in the topic typically looks like this:

Offset: 48 Key: 1588669544871 Message: {"type":"END_OF_INPUT_RECORD"} Timestamp: 2020-05-05 11:05:44

The error is thrown every time I restart the application. Could you add the possibility to catch the error and resume by code?

peterprib commented 4 years ago

Problem is not in my code but in kafka-node see https://github.com/SOHU-Co/kafka-node/issues/1210 If you have a set of actions that overcome the problem point that I can automate, I will but it may not work in all cases. When you say "manually restart" what is the actual action being performed? Note, some problem points may be fixed by configuration setting as suggest in https://github.com/SOHU-Co/kafka-node/issues/1190. This seems to be related to scenario raised.

InSupport commented 4 years ago

I've set Out Of Range Offset to Latest and that didn't do anything.

Manually restart as in restarting all the flows from the web-ui (Deploy->Restart Flows). I've created a temporary workaround by having a trigger-node activating a web request when it doesn't recieve anything for 15s. The web request calls http://localhost:1880/flows with the reload command, so all flows are restarted. This works when it's only that flow in the system but wouldn't work if there's other flows around.

peterprib commented 4 years ago

Sounds like a cache issue. Not sure the significance of more than one node. Will try and put in some logic that reloads node-kafka on this failure to clear out cache which should reflect your action. May cause other issues and be complex as I have to understand the impact on all nodes as this is shared resources at manager node level.

peterprib commented 4 years ago

Before I launch into lots of effort, see you have tried latest but have you tried earliest?

InSupport commented 4 years ago

Now that you mention it, that does seem to work. It continues to consume but the error message isn't reset. It's stuck on "OffsetOutOfRange (PAUSED)" even though it continues to consume. The error message is reset upon flow restart.

Could you look into the error message part?

peterprib commented 4 years ago

Slightly confused as if code gets OffsetOutOfRange it pauses node. Then you can resume again from GUI and it will get rid of this message. If you restart flow then it puts it into yellow in theory so don't understand why this works in the way working.

Have now changed code to separate out driver per node and enable reconnect. Will leave in case useful, Consume a tad more resources. Was going to get you to test reconnect then if this worked automate a reconnect once if OffsetOutOfRange is signalled. Was pondering if this was good behaviour as I assume it would have been better to use resume in event of such failure as known being restart of application where OffsetOutOfRange should be looked at before making a decision. Have to work out restart of flow actually does and why it doesn't rebuild nodes. May need to handle the event. It implies you have flushed a messages which I would have thought poor behaviour. Implies message that caused issue was accepted and flushed.