rabbitmq / discussions

Please use RabbitMQ mailing list for questions. Issues that are questions, discussions or lack details necessary to investigate them are moved to this repository.
3 stars 4 forks source link

MQTT: client ID Raft machine fails with a "missing_segment_header" during plugin activation #4

Closed corvusmod closed 5 years ago

corvusmod commented 5 years ago

Hi,

With the latest update the service doesn't start. I did some test and the problem seems related with the mqtt plugin. If I disable all plugins the service start flawlessly:

2019-11-11 13:59:19.424 [info] <0.8.0> Log file opened with Lager
2019-11-11 13:59:24.922 [info] <0.8.0> Feature flags: list of feature flags found:
2019-11-11 13:59:24.922 [info] <0.8.0> Feature flags:   [ ] implicit_default_bindings
2019-11-11 13:59:24.922 [info] <0.8.0> Feature flags:   [ ] quorum_queue
2019-11-11 13:59:24.923 [info] <0.8.0> Feature flags:   [ ] virtual_host_metadata
2019-11-11 13:59:24.923 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2019-11-11 13:59:31.195 [info] <0.296.0> ra: meta data store initialised. 1 record(s) recovered
2019-11-11 13:59:31.196 [info] <0.301.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit@ip-172-31-10-228/quorum/rabbit@ip-172-31-10-228/00000108.wal"]
2019-11-11 13:59:31.200 [info] <0.305.0> 
 Starting RabbitMQ 3.8.1 on Erlang 22.1.7
 Copyright (c) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL 1.1. Website: https://rabbitmq.com
2019-11-11 13:59:31.203 [info] <0.305.0> 
 node           : rabbit@ip-172-31-10-228
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : VcOVvu2ZTjHbYbayr0NOdg==
 log(s)         : /var/log/rabbitmq/rabbit@ip-172-31-10-228.log
                : /var/log/rabbitmq/rabbit@ip-172-31-10-228_upgrade.log
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@ip-172-31-10-228
2019-11-11 13:59:31.232 [info] <0.305.0> Running boot step pre_boot defined by app rabbit
2019-11-11 13:59:31.232 [info] <0.305.0> Running boot step rabbit_core_metrics defined by app rabbit
2019-11-11 13:59:31.233 [info] <0.305.0> Running boot step rabbit_alarm defined by app rabbit
2019-11-11 13:59:31.250 [info] <0.320.0> Memory high watermark set to 782 MiB (820319027 bytes) of 1955 MiB (2050797568 bytes) total
2019-11-11 13:59:31.279 [info] <0.336.0> Enabling free disk space monitoring
2019-11-11 13:59:31.280 [info] <0.336.0> Disk free limit set to 50MB
2019-11-11 13:59:31.285 [info] <0.305.0> Running boot step code_server_cache defined by app rabbit
2019-11-11 13:59:31.285 [info] <0.305.0> Running boot step file_handle_cache defined by app rabbit
2019-11-11 13:59:31.286 [info] <0.347.0> FHC read buffering:  OFF
2019-11-11 13:59:31.286 [info] <0.347.0> FHC write buffering: ON
2019-11-11 13:59:31.286 [info] <0.346.0> Limiting to approx 32671 file handles (29401 sockets)
2019-11-11 13:59:31.293 [info] <0.305.0> Running boot step worker_pool defined by app rabbit
2019-11-11 13:59:31.294 [info] <0.306.0> Will use 2 processes for default worker pool
2019-11-11 13:59:31.294 [info] <0.306.0> Starting worker pool 'worker_pool' with 2 processes in it
2019-11-11 13:59:31.294 [info] <0.305.0> Running boot step database defined by app rabbit
2019-11-11 13:59:37.296 [info] <0.305.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-11-11 13:59:37.296 [info] <0.305.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-11-11 13:59:37.320 [info] <0.305.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-11-11 13:59:37.321 [info] <0.305.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping registration.
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step database_sync defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step feature_flags defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step codec_correctness_check defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step external_infrastructure defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step rabbit_registry defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step rabbit_queue_location_random defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step rabbit_event defined by app rabbit
2019-11-11 13:59:37.321 [info] <0.305.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_exchange_type_direct defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_exchange_type_headers defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_exchange_type_topic defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_mirror_queue_mode_all defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_mirror_queue_mode_exactly defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_mirror_queue_mode_nodes defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_priority_queue defined by app rabbit
2019-11-11 13:59:37.322 [info] <0.305.0> Priority queues enabled, real BQ is rabbit_variable_queue
2019-11-11 13:59:37.322 [info] <0.305.0> Running boot step rabbit_queue_location_client_local defined by app rabbit
2019-11-11 13:59:37.323 [info] <0.305.0> Running boot step rabbit_queue_location_min_masters defined by app rabbit
2019-11-11 13:59:37.323 [info] <0.305.0> Running boot step kernel_ready defined by app rabbit
2019-11-11 13:59:37.323 [info] <0.305.0> Running boot step rabbit_sysmon_minder defined by app rabbit
2019-11-11 13:59:37.323 [info] <0.305.0> Running boot step rabbit_epmd_monitor defined by app rabbit
2019-11-11 13:59:37.327 [info] <0.378.0> epmd monitor knows us, inter-node communication (distribution) port: 25672
2019-11-11 13:59:37.328 [info] <0.305.0> Running boot step guid_generator defined by app rabbit
2019-11-11 13:59:37.333 [info] <0.305.0> Running boot step rabbit_node_monitor defined by app rabbit
2019-11-11 13:59:37.333 [info] <0.382.0> Starting rabbit_node_monitor
2019-11-11 13:59:37.334 [info] <0.305.0> Running boot step delegate_sup defined by app rabbit
2019-11-11 13:59:37.335 [info] <0.305.0> Running boot step rabbit_memory_monitor defined by app rabbit
2019-11-11 13:59:37.335 [info] <0.305.0> Running boot step core_initialized defined by app rabbit
2019-11-11 13:59:37.335 [info] <0.305.0> Running boot step upgrade_queues defined by app rabbit
2019-11-11 13:59:37.358 [info] <0.305.0> Running boot step rabbit_connection_tracking defined by app rabbit
2019-11-11 13:59:37.358 [info] <0.305.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit
2019-11-11 13:59:37.358 [info] <0.305.0> Running boot step rabbit_exchange_parameters defined by app rabbit
2019-11-11 13:59:37.358 [info] <0.305.0> Running boot step rabbit_mirror_queue_misc defined by app rabbit
2019-11-11 13:59:37.359 [info] <0.305.0> Running boot step rabbit_policies defined by app rabbit
2019-11-11 13:59:37.360 [info] <0.305.0> Running boot step rabbit_policy defined by app rabbit
2019-11-11 13:59:37.360 [info] <0.305.0> Running boot step rabbit_queue_location_validator defined by app rabbit
2019-11-11 13:59:37.360 [info] <0.305.0> Running boot step rabbit_quorum_memory_manager defined by app rabbit
2019-11-11 13:59:37.360 [info] <0.305.0> Running boot step rabbit_vhost_limit defined by app rabbit
2019-11-11 13:59:37.360 [info] <0.305.0> Running boot step recovery defined by app rabbit
2019-11-11 13:59:37.364 [info] <0.419.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@ip-172-31-10-228/msg_stores/vhosts/6L19J0ZI9130CN2MP4A516L7S' for vhost '/vanilla' exists
2019-11-11 13:59:37.370 [info] <0.419.0> Starting message stores for vhost '/vanilla'
2019-11-11 13:59:37.371 [info] <0.423.0> Message store "6L19J0ZI9130CN2MP4A516L7S/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2019-11-11 13:59:37.376 [info] <0.419.0> Started message store of type transient for vhost '/vanilla'
2019-11-11 13:59:37.377 [info] <0.426.0> Message store "6L19J0ZI9130CN2MP4A516L7S/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2019-11-11 13:59:37.381 [info] <0.419.0> Started message store of type persistent for vhost '/vanilla'
2019-11-11 13:59:37.409 [info] <0.485.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@ip-172-31-10-228/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2019-11-11 13:59:37.413 [info] <0.485.0> Starting message stores for vhost '/'
2019-11-11 13:59:37.414 [info] <0.489.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2019-11-11 13:59:37.419 [info] <0.485.0> Started message store of type transient for vhost '/'
2019-11-11 13:59:37.420 [info] <0.492.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2019-11-11 13:59:37.424 [info] <0.485.0> Started message store of type persistent for vhost '/'
2019-11-11 13:59:37.663 [info] <0.305.0> Running boot step empty_db_check defined by app rabbit
2019-11-11 13:59:37.663 [info] <0.305.0> Running boot step rabbit_looking_glass defined by app rabbit
2019-11-11 13:59:37.663 [info] <0.305.0> Running boot step rabbit_core_metrics_gc defined by app rabbit
2019-11-11 13:59:37.663 [info] <0.305.0> Running boot step background_gc defined by app rabbit
2019-11-11 13:59:37.663 [info] <0.305.0> Running boot step connection_tracking defined by app rabbit
2019-11-11 13:59:37.663 [info] <0.305.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@ip-172-31-10-228'
2019-11-11 13:59:37.664 [info] <0.305.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@ip-172-31-10-228'
2019-11-11 13:59:37.664 [info] <0.305.0> Running boot step routing_ready defined by app rabbit
2019-11-11 13:59:37.664 [info] <0.305.0> Running boot step pre_flight defined by app rabbit
2019-11-11 13:59:37.664 [info] <0.305.0> Running boot step notify_cluster defined by app rabbit
2019-11-11 13:59:37.664 [info] <0.305.0> Running boot step networking defined by app rabbit
2019-11-11 13:59:37.671 [info] <0.569.0> started TCP listener on [::]:5672
2019-11-11 13:59:37.671 [info] <0.305.0> Running boot step cluster_name defined by app rabbit
2019-11-11 13:59:37.671 [info] <0.305.0> Running boot step direct_client defined by app rabbit
2019-11-11 13:59:38.772 [notice] <0.107.0> Changed loghwm of /var/log/rabbitmq/rabbit@ip-172-31-10-228.log to 50
2019-11-11 13:59:39.327 [info] <0.8.0> Server startup complete; 0 plugins started.

But once I enable the mqtt plugin it crashes:

2019-11-11 14:03:43.328 [info] <0.839.0> MQTT retained message store: rabbit_mqtt_retained_msg_store_dets
2019-11-11 14:03:43.344 [info] <0.860.0> started MQTT TCP listener on [::]:1883
2019-11-11 14:03:59.394 [error] <0.866.0> ** State machine mqtt_node terminating
** When server state  = {undefined,"ra_server_proc:format_status/2 crashed"}
** Reason for termination = error:{badmatch,{error,missing_segment_header}}
** Callback mode = undefined
** Stacktrace =
**  [{ra_log,'-recover_range/2-fun-1-',2,[{file,"src/ra_log.erl"},{line,1034}]},{ra_log,'-recover_range/2-lists^foldl/2-1-',3,[{file,"src/ra_log.erl"},{line,1032}]},{ra_log,recover_range,2,[{file,"src/ra_log.erl"},{line,1032}]},{ra_log,init,1,[{file,"src/ra_log.erl"},{line,143}]},{ra_server,init,1,[{file,"src/ra_server.erl"},{line,234}]},{ra_server_proc,init,1,[{file,"src/ra_server_proc.erl"},{line,237}]},{gen_statem,init_it,6,[{file,"gen_statem.erl"},{line,714}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]
2019-11-11 14:03:59.395 [error] <0.866.0> CRASH REPORT Process <0.866.0> with 0 neighbours crashed with reason: no match of right hand value {error,missing_segment_header} in ra_log:'-recover_range/2-fun-1-'/2 line 1034
2019-11-11 14:03:59.395 [error] <0.865.0> Supervisor {<0.865.0>,ra_server_sup} had child mqtt_node started with ra_server_proc:start_link(#{await_condition_timeout => 30000,broadcast_time => 100,cluster_name => mqtt_node,friendly_name => ...,...}) at undefined exit with reason {badmatch,{error,missing_segment_header}} in context start_error
2019-11-11 14:03:59.396 [error] <0.836.0> CRASH REPORT Process <0.836.0> with 0 neighbours exited with reason: no match of right hand value {error,{shutdown,{failed_to_start_child,mqtt_node,{badmatch,{error,missing_segment_header}}}}} in rabbit_mqtt:start/2 line 29 in application_master:init/4 line 138
2019-11-11 14:03:59.396 [info] <0.43.0> Application rabbitmq_mqtt exited with reason: no match of right hand value {error,{shutdown,{failed_to_start_child,mqtt_node,{badmatch,{error,missing_segment_header}}}}} in rabbit_mqtt:start/2 line 29
2019-11-11 14:03:59.396 [info] <0.860.0> stopped MQTT TCP listener on [::]:1883
2019-11-11 14:03:59.396 [info] <0.43.0> Application amqp_client exited with reason: stopped

To be sure that it's not a problem of configuration, I commented all the mqtt configuration from the rabbitmq.conf file.

Any solution? Thanks

michaelklishin commented 5 years ago

It has nothing to do with the MQTT plugin. The Raft library fails to recover a log range because a segment is missing its header.

Was the node shut down cleanly? Can you share this node's data directory or at least the rabbit@{hostname}/quorum directory under it?

If you need to start the node and don't care about preserving its data you can move the rabbit@{hostname}/quorum directory.

michaelklishin commented 5 years ago

Actually maybe it is specific to the MQTT plugin since it's the MQTT client ID tracking machine that reports it. But we cannot be 100% sure. If you don't use quorum queues then you can safely remove the quorum data directory since what MQTT plugin uses Raft for is inherently transient data (a set of connection client IDs).

We'd need a data directory that exhibits this behavior in order to investigate.

corvusmod commented 5 years ago

Thanks @michaelklishin Looking at the logs I found that must be related with something there and renamed the quorum folder, and now its working. No data seems to be lost.

I can share with you the quorum folder but I don't feel ok uploading here. If you can share with me a email or some place where upload it I will send the files to you.

Thanks for your fast reply.

corvusmod commented 5 years ago

@michaelklishin Sent the file to your email. For me it's working so change the state of the issue to whatever you want.

michaelklishin commented 5 years ago

I've got the directory, thanks. @corvusmod the quorum data used by the MQTT plugin is transient in nature so you wouldn't notice anything lost, indeed.

michaelklishin commented 5 years ago

We have identified a segment file that is 0 bytes. Will file a more specific issue around the handling of those.

michaelklishin commented 5 years ago

We have reviewed the data dir and logs provided by @corvusmod (thanks!) and so far concluded that

johanrhodin commented 4 years ago

We ran into a version of this (3.8.2). Is the data directory of interest?

michaelklishin commented 4 years ago

Together with logs as we need to know the sequence of events and node names.

On Tue, Jan 14, 2020 at 12:13 AM Johan Rhodin notifications@github.com wrote:

We ran into a version of this (3.8.2). Is the data directory of interest?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rabbitmq/discussions/issues/4?email_source=notifications&email_token=AAAAIQRCMLLISGQ4SGVE36DQ5TKOXA5CNFSM4JLWDVO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI2JYMI#issuecomment-573873201, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAIQVJFBBZGKP6E5JRJPTQ5TKOXANCNFSM4JLWDVOQ .

-- MK

Staff Software Engineer, Pivotal/RabbitMQ