matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

An event was sent twice down `/sync` (and down application service txns?) consistently across clients; ghost notifications #14539

Open MichaelSasser opened 1 year ago

MichaelSasser commented 1 year ago

Description

Ghost notifications

Since synapse v1.70 or v1.71, I've noticed randomly occurring "ghost notification", happening on element-web, element-desktop and element-android, hydrogen. They usually come in small, fast bursts of two to four notifications (I hear an audible sound telling me about new events in a room). When I check for new events, there aren't any. Nothing has changed. I have the options Developer mode and Show hidden events in timeline enabled in element, so I would assume, I'd see any changes to a room.

Event duplication

Around the same Time, I've noticed some events, some from weeks or months ago, from being randomly duplicated and pushed as new ones. This happens independently of the "ghost notification issue". At first, I've only noticed this in one particular room, where the same messages were duplicated a few times over a couple of days (Example: see "Image 1" below), possibly leading to an unstable room (see "Possible implications). Today, the same thing happened in our mjolnir control room, where mjolnir (configured as a client and not as an appservice) executed the duplicated command (Example: see "Image 2" below). Actually, I'm actually not certain, if my homeserver is the one, the issues originate from, as the duplicated events are visible on other instances too, including matrix.org and maybe others as well.

Image 1: Duplicated messages in the python room

event_duplicated_python_room

Image 2: Mjolnir bot reacts to duplicated events

The "failed to leave" error from mjolnir is unrelated (and already fixed). The duplication errors happened before the room, it was trying to leave, had issues, too, and might have led to the issue, where the room was becoming unstable on my homeserver. event_duplicated_mjolnir

Possible implications

After this happened a few times, in the Python room #python:matrix.org, the room became unstable on my end, and I was unable to send events to that room any more because of missing state events. I also wasn't able to remove the room with the admin API, so, I manually had to remove the room from the database to gain access again:

BEGIN; SET CONSTRAINTS ALL DEFERRED;

 DELETE FROM appservice_room_list WHERE room_id = '!roomid:matrix.org';
 DELETE FROM batch_events WHERE room_id = '!roomid:matrix.org';
 DELETE FROM current_state_events WHERE room_id = '!roomid:matrix.org';
 DELETE FROM blocked_rooms WHERE room_id = '!roomid:matrix.org';
 DELETE FROM current_state_delta_stream WHERE room_id = '!roomid:matrix.org';
 DELETE FROM destination_rooms WHERE room_id = '!roomid:matrix.org';
 DELETE FROM device_lists_changes_in_room WHERE room_id = '!roomid:matrix.org';
 DELETE FROM e2e_room_keys WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_auth_chain_to_calculate WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_auth WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_backward_extremities WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_failed_pull_attempts WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_json WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_forward_extremities WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_edges WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_labels WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_push_actions WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_push_summary WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_reports WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_search WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_txn_id WHERE room_id = '!roomid:matrix.org';
 DELETE FROM event_edges WHERE event_id IN (SELECT event_id FROM events WHERE room_id='!roomid:matrix.org');
 DELETE FROM event_auth_chains WHERE event_id IN (SELECT event_id FROM events WHERE room_id = '!roomid:matrix.org');
 DELETE FROM event_relations WHERE event_id IN (SELECT event_id FROM events WHERE room_id = '!roomid:matrix.org');
 DELETE FROM events WHERE room_id = '!roomid:matrix.org';
 DELETE FROM federation_inbound_events_staging WHERE room_id = '!roomid:matrix.org';
 DELETE FROM insertion_event_edges WHERE room_id = '!roomid:matrix.org';
 DELETE FROM insertion_event_extremities WHERE room_id = '!roomid:matrix.org';
 DELETE FROM insertion_events WHERE room_id = '!roomid:matrix.org';
 DELETE FROM local_current_membership WHERE room_id = '!roomid:matrix.org';
 DELETE FROM partial_state_events WHERE room_id = '!roomid:matrix.org';
 DELETE FROM partial_state_rooms_servers WHERE room_id = '!roomid:matrix.org';
 DELETE FROM pusher_throttle WHERE room_id = '!roomid:matrix.org';
 DELETE FROM receipts_linearized WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_account_data WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_aliases WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_depth WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_memberships WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_retention WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_stats_current WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_stats_earliest_token WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_stats_state WHERE room_id = '!roomid:matrix.org';
 DELETE FROM state_groups_state WHERE room_id = '!roomid:matrix.org';
 DELETE FROM rooms WHERE room_id = '!roomid:matrix.org';
 DELETE FROM user_directory WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_tags WHERE room_id = '!roomid:matrix.org';
 DELETE FROM room_tags_revisions WHERE room_id = '!roomid:matrix.org';
 DELETE FROM state_groups WHERE room_id = '!roomid:matrix.org';
 DELETE FROM state_events WHERE room_id = '!roomid:matrix.org';
 DELETE FROM users_in_public_rooms WHERE room_id = '!roomid:matrix.org';
 DELETE FROM users_who_share_private_rooms WHERE room_id = '!roomid:matrix.org';
 DELETE FROM stream_ordering_to_exterm WHERE room_id = '!roomid:matrix.org';
 DELETE FROM receipts_graph WHERE room_id = '!roomid:matrix.org';
 DELETE FROM threads WHERE room_id = '!roomid:matrix.org';
 DELETE FROM partial_state_rooms WHERE room_id = '!roomid:matrix.org';
COMMIT;

REINDEX (verbose) database synapse;
REINDEX (verbose) system synapse;
VACUUM FULL VERBOSE ANALYZE;

(Because I had to roll back a few times, while I was trying to figure out, how I can remove the room, I accidentally deleted the logs as well.)

Steps to reproduce

Homeserver

matrix.org; michaelsasser.org

Synapse Version

matrixdotorg/synapse; tags: v1.71.0, v1.72.0, possibly v1.70.0

Installation Method

Docker (matrixdotorg/synapse)

Database

postgres:15.0-alpine on same machine, It was recently restored to upgrade to v15, not sure, if it happeded before; recently restored while trying to removing room, issues happened before

Workers

Multiple workers

Platform

Environment: CX31 Hetzner Cloud Using the Ansible playbook: spantaleev/matrix-docker-ansible-deploy Debian: Linux matrix 5.10.0-19-amd64 vector-im/element-web#1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux Docker: Docker version 20.10.21, build baeda1f Containerd: containerd containerd.io 1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661

Configuration

Playbook

Using the Ansible playbook: [spantaleev/matrix-docker-ansible-deploy], contains multiple files (result below):

Playbook/config ```yaml --- # hookshot matrix_hookshot_enabled: true # default matrix_hookshot_gitlab_enabled: true matrix_hookshot_permissions: # Allow all users to send commands to existing services - actor: "*" services: - service: gitlab level: commands # Allow users on this domain to log in to gitlab. - actor: "!foo:michaelsasser.org" services: - service: gitlab level: login # Allow this specific user to do any action - actor: "@foo:michaelsasser.org" services: - service: "*" level: admin --- # Mjolnir matrix_bot_mjolnir_enabled: true matrix_bot_mjolnir_access_token: "foo" matrix_bot_mjolnir_management_room: "!foo:michaelsasser.org" matrix_bot_mjolnir_configuration_extension_yaml: | backgroundDelayMS: 50 fasterMembershipChecks: true recordIgnoredInvites: true commands: additionalPrefixes: - mjolnir - bot automaticallyRedactForReasons: - spam - advertising - advertisement - ad - scam protections: wordlist: words: - foo - more foo minutesBeforeTrusting: 1234 # 1234 is not the actual time --- # ssl matrix_ssl_lets_encrypt_support_email: foo@MichaelSasser.org --- # element # set integration manager matrix_client_element_integrations_ui_url: "https://scalar.vector.im/" matrix_client_element_integrations_rest_url: "https://scalar.vector.im/api" matrix_client_element_integrations_widgets_urls: ["https://scalar.vector.im/api"] # set server room directory matrix_client_element_roomdir_servers: - michaelsasser.org - matrix.org # Extra themes matrix_client_element_themes_enabled: false # Default Template matrix_client_element_default_theme: "dark" matrix_client_element_configuration_extension_json: | { "sso_redirect_options": { "immediate": true } } --- # hydrogen # matrix_client_hydrogen_enabled: true --- matrix_grafana_enabled: true matrix_grafana_anonymous_access: false matrix_grafana_default_admin_user: foo # Do not change matrix_grafana_default_admin_password: foo matrix_grafana_container_extra_arguments: - -e GF_AUTH_GENERIC_OAUTH_NAME="Single Sign-On" - -e GF_AUTH_GENERIC_OAUTH_ENABLED="true" - -e GF_AUTH_GENERIC_OAUTH_CLIENT_ID="matrix-grafana" - -e GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET="foo" - -e GF_AUTH_GENERIC_OAUTH_SCOPES="openid" - -e GF_AUTH_GENERIC_OAUTH_AUTH_URL="https://foo.michaelsasser.org/realms/MichaelSasser/protocol/openid-connect/auth" - -e GF_AUTH_GENERIC_OAUTH_TOKEN_URL="https://foo.michaelsasser.org/realms/MichaelSasser/protocol/openid-connect/token" - -e GF_AUTH_GENERIC_OAUTH_API_URL="https://foo.michaelsasser.org/realms/MichaelSasser/protocol/openid-connect/userinfo" - -e GF_AUTH_GENERIC_OAUTH_ALLOW_SIGN_UP="true" - -e GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH="contains(roles[*], 'Admin') && 'Admin' || contains(roles[*], 'Editor') && 'Editor' || contains(roles[*], 'Viewer') && 'Viewer'|| ''" - -e GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_STRICT="true" # Disable Login Form - -e GF_AUTH_DISABLE_LOGIN_FORM="true" - -e GF_AUTH_OAUTH_AUTO_LOGIN="true" - -e GF_AUTH_BASIC_ENABLED="false" --- # email matrix_mailer_sender_address: "foo@MichaelSasser.org" matrix_mailer_relay_use: true matrix_mailer_relay_host_name: "foo.michaelsasser.org" matrix_mailer_relay_host_port: 587 matrix_mailer_relay_auth: true matrix_mailer_relay_auth_username: "foo@michaelsasser.org" matrix_mailer_relay_auth_password: "foo" --- matrix_synapse_rust_synapse_compress_state_find_rooms_command_wait_time: 1200 matrix_synapse_rust_synapse_compress_state_compress_room_time: 68400 matrix_synapse_rust_synapse_compress_state_psql_import_time: 68400 # Base Domain Serving matrix_nginx_proxy_base_domain_serving_enabled: true matrix_nginx_proxy_connect_timeout: 120 matrix_nginx_proxy_send_timeout: 120 matrix_nginx_proxy_read_timeout: 120 matrix_nginx_send_timeout: 120 # Fix issue where the admin api is not reacheable anymore matrix_nginx_proxy_proxy_matrix_client_redirect_root_uri_to_domain: "" matrix_nginx_proxy_proxy_matrix_client_api_forwarded_location_synapse_admin_api_enabled: true --- matrix_postgres_connection_password: "foo" matrix_postgres_container_postgres_bind_port: 5432 matrix_postgres_process_extra_arguments: [ "-c max_connections=120", "-c shared_buffers=512MB", "-c effective_cache_size=1536MB", "-c maintenance_work_mem=128MB", "-c checkpoint_completion_target=0.9", "-c wal_buffers=16MB", "-c default_statistics_target=100", "-c random_page_cost=1.1", "-c effective_io_concurrency=200", "-c work_mem=2184kB", "-c min_wal_size=1GB", "-c max_wal_size=4GB", "-c max_worker_processes=2", "-c max_parallel_workers_per_gather=1", "-c max_parallel_workers=2", "-c max_parallel_maintenance_workers=1" ] # Postgres Backups matrix_postgres_backup_enabled: true matrix_postgres_backup_keep_months: 12 --- # grafana prometheus matrix_prometheus_enabled: true matrix_prometheus_node_exporter_enabled: true matrix_prometheus_postgres_exporter_enabled: false matrix_prometheus_postgres_exporter_database_username: "matrix_prometheus_postgres_exporter" matrix_prometheus_postgres_exporter_database_password: "foo" matrix_prometheus_node_exporter_process_extra_arguments: - "--collector.disable-defaults" - "--collector.cpu" - "--collector.filesystem" - "--collector.os" - "--collector.stat" - "--collector.meminfo" - "--collector.loadavg" - "--collector.time" - "--collector.uname" # Mandatory, otherwide the data vanishes from grafana --- # synapse # Federation matrix_synapse_federation_enabled: true matrix_synapse_allow_public_rooms_over_federation: true # Controls whether accessing the server's public rooms directory can be done # without authentication. # For private servers, you most likely wish to require authentication, # unless you know what list of rooms you're publishing to the world and # explicitly want to do it. matrix_synapse_allow_public_rooms_without_auth: true # Matrix cache size (ram) default: 0.5 matrix_synapse_caches_global_factor: 1.5 # 0.5, is default matrix_synapse_event_cache_size: "100K" # OpenID (Keycloak) (rest see below in matrix_synapse_configuration_extension_yaml) matrix_nginx_proxy_proxy_matrix_client_api_forwarded_location_synapse_oidc_api_enabled: true # Extra Synapse homeserver.yaml config matrix_synapse_configuration_extension_yaml: | suppress_key_server_warning: true # defaults to matrix.org #################### # room complexity #################### limit_remote_rooms: enabled: true complexity: 0.5 complexity_error: "This room is too complex." admins_can_join: true #################### # blacklist #################### ip_range_blacklist: - "123.123.123.123/32" # foo.xyz - "123.123.123.123/32" # matrix.foo.com - "123.123.123.123/32" # matrix.foo.sh - "123.123.123.123/32" # matrix.foo.net - "123.123.123.123/32" # matrix.foo.biz (CNAME) #################### # retention #################### redaction_retention_period: 30d retention: enabled: true default_policy: min_lifetime: 4w max_lifetime: 8w allowed_lifetime_min: 1d allowed_lifetime_max: 24w purge_jobs: - longest_max_lifetime: 3d interval: 12h - shortest_max_lifetime: 3d interval: 1d #################### # server notice #################### server_notices: system_mxid_localpart: server system_mxid_display_name: "Server Notices" system_mxid_avatar_url: "mxc://{{ matrix_domain }}/foo" room_name: "Server Notices" #################### # OIDC #################### oidc_providers: - idp_id: keycloak idp_name: "Single Sign-On" issuer: "https://foo.michaelsasser.org/realms/MichaelSasser" client_id: "synapse" client_secret: "foo" scopes: ["openid", "profile"] allow_existing_users: true user_mapping_provider: config: localpart_template: "{% raw %}{{ user.preferred_username }}{% endraw %}" display_name_template: "{% raw %}{{ user.name }}{% endraw %}" # Media retention matrix_synapse_media_retention_local_media_lifetime: 8w matrix_synapse_media_retention_remote_media_lifetime: 1w # workers matrix_synapse_workers_enabled: true matrix_synapse_workers_preset: michaelsasser_org # https://github.com/matrix-org/synapse/blob/develop/docs/workers.md matrix_synapse_workers_presets: michaelsasser_org: generic_workers_count: 1 pusher_workers_count: 1 federation_sender_workers_count: 1 media_repository_workers_count: 1 appservice_workers_count: 1 user_dir_workers_count: 1 background_workers_count: 1 stream_writer_events_stream_workers_count: 0 stream_writer_typing_stream_workers_count: 0 stream_writer_to_device_stream_workers_count: 0 stream_writer_account_data_stream_workers_count: 0 stream_writer_receipts_stream_workers_count: 0 stream_writer_presence_stream_workers_count: 0 # A secret used to protect access keys issued by the server. # You can put any string here, but generating a strong one is preferred # (e.g. `pwgen -s 64 1`). matrix_synapse_macaroon_secret_key: "foo" matrix_homeserver_generic_secret_key: "{{ matrix_synapse_macaroon_secret_key }}" matrix_synapse_max_upload_size_mb: 256 matrix_synapse_auto_join_rooms: - "#support:{{ matrix_domain }}" # matrix_synapse_use_presence: true # Disable on high load # Shared Secret Provider matrix_synapse_ext_password_provider_shared_secret_auth_enabled: false # disabled to make sure no m.password authentication is available matrix_synapse_ext_password_provider_shared_secret_auth_m_login_password_support_enabled: false # disabled to make sure no m.password authentication is available matrix_synapse_ext_password_provider_shared_secret_auth_shared_secret: "foo" --- devture_timesync_ntpd_package: systemd-timesyncd devture_timesync_ntpd_service: systemd-timesyncd # A shared secret (between Coturn and Synapse) used for authentication. # You can put any string here, but generating a strong one is preferred # (e.g. `pwgen -s 64 1`). matrix_coturn_turn_static_auth_secret: "foo" --- # Enable location sharing matrix_well_known_matrix_client_configuration_extension_json: |- { "org.matrix.msc3488.tile_server": { "map_style_url": "https://api.maptiler.com/maps/streets/style.json?key=foo" }, "m.tile_server": { "map_style_url": "https://api.maptiler.com/maps/streets/style.json?key=foo" }, "setting_defaults": { "MessageComposerInput.showStickersButton": false }, "features": { "feature_latex_maths": true, "feature_pinning": true, "feature_thread": true, "feature_html_topic": true }, "admins": [{ "matrix_id": "@foo:michaelsasser.org", "email_address": "foo@michaelsasser.org", "role": "admin" }, { "email_address": "foo@michaelsasser.org", "role": "security" }] } ``` Results in: ## homeserver.yaml ```yaml account_threepid_delegates: msisdn: '' alias_creation_rules: - action: allow alias: '*' room_id: '*' user_id: '*' allow_guest_access: false allow_public_rooms_over_federation: true allow_public_rooms_without_auth: true allowed_avatar_mimetypes: - image/png - image/jpeg - image/gif app_service_config_files: - /hookshot-registration.yml auto_join_rooms: - '#foo:michaelsasser.org' autocreate_auto_join_rooms: true background_updates: null caches: global_factor: 1.5 per_cache_factors: null cas_config: null daemonize: false database: args: cp_max: 10 cp_min: 5 database: synapse host: matrix-postgres password: foo port: 5432 user: synapse name: psycopg2 txn_limit: 0 default_room_version: '9' email: app_name: Matrix client_base_url: https://element.michaelsasser.org enable_notifs: true invite_client_location: https://app.element.io notif_for_new_users: true notif_from: Matrix require_transport_security: false smtp_host: matrix-mailer smtp_port: 8025 enable_media_repo: false enable_metrics: true enable_registration: false enable_registration_captcha: false enable_registration_without_verification: false enable_room_list_search: true encryption_enabled_by_default_for_room_type: 'off' event_cache_size: 100K federation_rr_transactions_per_room_per_second: 50 federation_sender_instances: - matrix-synapse-worker-federation-sender-0 form_secret: foo include_profile_data_on_invite: true instance_map: {} ip_range_blacklist: - 123.123.123.123/32 # not actually 123... - 123.123.123.123/32 - 123.123.123.123/32 - 123.123.123.123/32 - 123.123.123.123/32 limit_profile_requests_to_users_who_share_rooms: false limit_remote_rooms: admins_can_join: true complexity: 0.5 complexity_error: This room is too complex. enabled: true listeners: - bind_addresses: - 0.0.0.0 port: 9100 type: metrics - bind_addresses: - '::' port: 8008 resources: - compress: false names: - client tls: false type: http x_forwarded: true - bind_addresses: - '::' port: 8048 resources: - compress: false names: - federation tls: false type: http x_forwarded: true - bind_addresses: - 0.0.0.0 port: 9093 resources: - names: - replication type: http log_config: /data/matrix.michaelsasser.org.log.config macaroon_secret_key: foo manhole_settings: null max_spider_size: 10M max_upload_size: 256M media_instance_running_background_jobs: matrix-synapse-worker-media-repository-0 media_retention: local_media_lifetime: 8w remote_media_lifetime: 1w media_storage_providers: [] media_store_path: /matrix-media-store-parent/media-store metrics_flags: null modules: [] notify_appservices_from_worker: matrix-synapse-worker-appservice-0 oembed: null oidc_providers: - allow_existing_users: true client_id: synapse client_secret: foo idp_id: keycloak idp_name: Single Sign-On issuer: https://foo.michaelsasser.org/realms/MichaelSasser scopes: - openid - profile user_mapping_provider: config: display_name_template: '{{ user.name }}' localpart_template: '{{ user.preferred_username }}' old_signing_keys: null opentracing: null password_config: localdb_enabled: false pepper: '' policy: null pid_file: /homeserver.pid presence: enabled: true public_baseurl: https://matrix.michaelsasser.org/ purge_jobs: - interval: 12h longest_max_lifetime: 3d - interval: 1d shortest_max_lifetime: 3d push: include_content: true pusher_instances: - matrix-synapse-worker-pusher-0 rc_admin_redaction: burst_count: 50 per_second: 1 rc_federation: concurrent: 3 reject_limit: 50 sleep_delay: 500 sleep_limit: 10 window_size: 1000 rc_invites: per_issuer: burst_count: 10 per_second: 0.3 per_room: burst_count: 10 per_second: 0.3 per_user: burst_count: 5 per_second: 0.003 rc_joins: local: burst_count: 10 per_second: 0.1 remote: burst_count: 10 per_second: 0.01 rc_login: account: burst_count: 3 per_second: 0.17 address: burst_count: 3 per_second: 0.17 failed_attempts: burst_count: 3 per_second: 0.17 rc_message: burst_count: 10 per_second: 0.2 rc_registration: burst_count: 3 per_second: 0.17 recaptcha_private_key: '' recaptcha_public_key: '' redaction_retention_period: 30d redis: enabled: true host: matrix-redis password: null port: 6379 registration_requires_token: false registration_shared_secret: foo report_stats: false require_auth_for_profile_requests: false retention: allowed_lifetime_max: 24w allowed_lifetime_min: 1d default_policy: max_lifetime: 8w min_lifetime: 4w enabled: true room_list_publication_rules: - action: allow alias: '*' room_id: '*' user_id: '*' room_prejoin_state: null run_background_tasks_on: matrix-synapse-worker-background-0 saml2_config: sp_config: null user_mapping_provider: config: null send_federation: false server_name: michaelsasser.org server_notices: room_name: Server Notices system_mxid_avatar_url: mxc://michaelsasser.org/foo system_mxid_display_name: Server Notices system_mxid_localpart: server signing_key_path: /data/matrix.michaelsasser.org.signing.key spam_checker: [] sso: null start_pushers: false stats: null stream_writers: {} suppress_key_server_warning: true templates: null tls_certificate_path: null tls_private_key_path: null trusted_key_servers: - server_name: matrix.org turn_allow_guests: false turn_shared_secret: foo turn_uris: - turn:matrix.michaelsasser.org?transport=udp - turn:matrix.michaelsasser.org?transport=tcp ui_auth: null update_user_directory_from_worker: matrix-synapse-worker-user-dir-0 url_preview_accept_language: - en-US - en url_preview_enabled: true url_preview_ip_range_blacklist: - 127.0.0.0/8 - 10.0.0.0/8 - 172.16.0.0/12 - 192.168.0.0/16 - 100.64.0.0/10 - 192.0.0.0/24 - 169.254.0.0/16 - 192.88.99.0/24 - 198.18.0.0/15 - 192.0.2.0/24 - 198.51.100.0/24 - 203.0.113.0/24 - 224.0.0.0/4 - ::1/128 - fe80::/10 - fc00::/7 - 2001:db8::/32 - ff00::/8 - fec0::/10 user_directory: null user_ips_max_age: 28d worker_app: synapse.app.homeserver ``` ## Workers ```yaml # $cat worker.matrix-synapse-worker-* ################################################ worker_app: synapse.app.generic_worker worker_name: matrix-synapse-worker-appservice-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 ################################################ worker_listeners: - type: metrics bind_addresses: ['0.0.0.0'] port: 19300 worker_app: synapse.app.generic_worker worker_name: matrix-synapse-worker-background-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 ################################################ worker_listeners: - type: metrics bind_addresses: ['0.0.0.0'] port: 19700 worker_app: synapse.app.federation_sender worker_name: matrix-synapse-worker-federation-sender-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 ################################################ worker_listeners: - type: metrics bind_addresses: ['0.0.0.0'] port: 19400 worker_app: synapse.app.generic_worker worker_name: matrix-synapse-worker-generic-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 worker_main_http_uri: http://matrix-synapse:8008 ################################################ worker_listeners: - type: http bind_addresses: ['::'] x_forwarded: true port: 18111 resources: - names: ["client", "federation"] - type: metrics bind_addresses: ['0.0.0.0'] port: 19111 worker_app: synapse.app.media_repository worker_name: matrix-synapse-worker-media-repository-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 ################################################ worker_listeners: - type: http bind_addresses: ['::'] x_forwarded: true port: 18551 resources: - names: ["media"] - type: metrics bind_addresses: ['0.0.0.0'] port: 19551 worker_app: synapse.app.pusher worker_name: matrix-synapse-worker-pusher-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 ################################################ worker_listeners: - type: metrics bind_addresses: ['0.0.0.0'] port: 19200 worker_app: synapse.app.generic_worker worker_name: matrix-synapse-worker-user-dir-0 worker_daemonize: false worker_log_config: /data/matrix.michaelsasser.org.log.config worker_replication_host: matrix-synapse worker_replication_http_port: 9093 ################################################ worker_listeners: - type: http bind_addresses: ['::'] x_forwarded: true port: 18661 resources: - names: ["client"] - type: metrics bind_addresses: ['0.0.0.0'] port: 19661 ``` ### Relevant log output ```log # Synapse + workers during the time the duplication, see "Image 2" happened: -- Journal begins at Sat 2022-11-12 20:25:09 CET, ends at Wed 2022-11-23 18:42:30 CET. -- Nov 23 11:33:23 matrix matrix-synapse[1029358]: 2022-11-23 10:33:23,569 - synapse.handlers.device - 1132 - WARNING - _maybe_retry_device_resync-770 - Failed to handle device list update for @kosyne:kaosteam.net: Failed to send request: DNSLookupError: DNS lookup failed: no results for hostname lookup: matrix.kaosteam.net. Nov 23 11:35:23 matrix matrix-synapse[1029358]: 2022-11-23 10:35:23,700 - synapse.handlers.device - 1132 - WARNING - _maybe_retry_device_resync-774 - Failed to handle device list update for @dries.staelens:rubicon.biz: Failed to send request: DNSLookupError: DNS lookup failed: no results for hostname lookup: rubicon.biz. Nov 23 11:39:17 matrix matrix-synapse-worker-federation-sender-0[1030801]: 2022-11-23 10:39:17,029 - synapse.federation.sender.per_destination_queue - 368 - WARNING - federation_transaction_transmission_loop-21408 - TX [sameteem.com] Failed to send transaction: Failed to send request: TimeoutError: User timeout caused connection failure. Nov 23 11:40:07 matrix matrix-synapse-worker-federation-sender-0[1030801]: 2022-11-23 10:40:07,211 - synapse.federation.sender.per_destination_queue - 368 - WARNING - federation_transaction_transmission_loop-21582 - TX [chat.jacobsen.app] Failed to send transaction: Failed to send request: TimeoutError: User timeout caused connection failure. Nov 23 11:40:22 matrix matrix-synapse-worker-federation-sender-0[1030801]: 2022-11-23 10:40:22,671 - synapse.federation.sender.per_destination_queue - 368 - WARNING - federation_transaction_transmission_loop-21588 - TX [chat.perafree.ml] Failed to send transaction: Failed to send request: TimeoutError: User timeout caused connection failure. Nov 23 11:45:53 matrix matrix-synapse[1029358]: 2022-11-23 10:45:53,782 - synapse.handlers.device - 1132 - WARNING - _maybe_retry_device_resync-795 - Failed to handle device list update for @dries.staelens:rubicon.biz: Failed to send request: DNSLookupError: DNS lookup failed: no results for hostname lookup: rubicon.biz. ```

Anything else that would be useful to know?

Full log during the time the duplication, see "Image 2" happened: message_duplication.log

All times & dates are UTC+1.

DMRobertson commented 1 year ago

Thank you for your thorough report.

On ghost notifications: it's difficult to say what could be the cause without timestamps or logs to start digging into. I think if someone edits a message you are named or @named in then you will hear an audible notification but not receive an obvious prompt as to why. I'm not sure if that explains what you've seen. If it happens again, please submit client logs (e.g. /rageshake on Element Web) and mention this issue in the description.

On event duplication: it's not immediately obvious to me that something has gone wrong in image 1. What makes you say that the two messages from 10th October 2022 are duplicated, rather than appearing out of order? Are you seeing the same message/state event take place with two different event IDs?

It's possible that

I also note that Element-Web and the mobile client (Element Android?) show a consistent timeline---which suggests that Synapse is presenting the same timeline via /sync to the two clients.

I don't fully follow what's happened in image 2 either. Are Michael and Administrator on the same homeserver as the client taking the screenshot?

After this happened a few times, in the Python room #python:matrix.org, the room became unstable on my end, and I was unable to send events to that room any more because of missing state events. I also wasn't able to remove the room with the admin API, so, I manually had to remove the room from the database to gain access again:

(Because I had to roll back a few times, while I was trying to figure out, how I can remove the room, I accidentally deleted the logs as well.)

That's very unfortunate. Please reach out to us if you end up in that situation; it's generally very unsafe to manually operate on Synapse's database (I hope you found all the tables that reference a room ID).

MichaelSasser commented 1 year ago

Before I get into detail, I want to thank you for your time and effort. I also wanted to mention that, I think, the issues "duplicates" and "ghost notifications" are somehow related and have the same origin. This is why I only created one issue.

I think if someone edits a message you are named or @named in then you will hear an audible notification but not receive an obvious prompt as to why. I'm not sure if that explains what you've seen.

That was my first thought, too. But with the option to show hidden events in timeline and developer mode enabled, I should see a { "type": m.room.message } in the timeline, where I can see the event redacting the other, which wasn't the case.

If it happens again, please submit client logs (e.g. /rageshake on Element Web) and mention this issue in the description.

I should have done that, right. I'll remember it for the next time.

[I]t's not immediately obvious to me that something has gone wrong in image 1. What makes you say that the two messages from 10th October 2022 are duplicated, rather than appearing out of order.

"Duplicated" is probably the wrong word, as those events have the same event ids as those shown previously in the timeline. I used the term "duplicated" because they were shown multiple times at different locations in the timeline, at the same time.

bekon232 and Szwendacz are hosted on another homeserver (X) to your own (Y).

Yes, this is correct. bekon232 and Szwendacz are on the same homeserver, but not on mine. I don't think it should make a difference, but it is maybe worth mentioning, there are > 10k other users in the room from 524 different servers. The room is not bridged. As far as I know, these are the only messages bekon232 and Szwendacz created in any room, anyone of my users is in. "As far as I know" because I have retention enabled in the homeserver.yml, and I could only find their m.room.member state event and the m.room.message events, as seen in the screenshot.

The two homeservers temporarily lost connection. Connectivity was later restored.

I haven't checked the logs specifically for their homeserver, but I do regular updates, and I have updated and restarted synapse during that process.

Something triggered your homeserver Y to retrieve missed messages from the other homeserver X. If so, the UX hasn't done a good job of communicating this.

I haven't looked into the details of federation on matrix yet, but I can say, the first occurrence of those messages in the timeline were fairly close to the origin_server_ts of the events. I also didn't see anything about missing events in the logs (at least not for the room in question) before I had issues with the room, which was weeks later.

I also note that Element-Web and the mobile client (Element Android?) show a consistent timeline---which suggests that Synapse is presenting the same timeline via /sync to the two clients.

Yes, the timeline seemed to be consistent between element-web, element-desktop and element-android (yes, the mobile client was element-android) and hydrogen. Even after clearing the cache on element-web and the mobile client, I could still see those messages in the timeline, at least in the case image 1 was taken. But I have also seen that those messages vanish after clearing the cache.

I don't fully follow what's happened in image 2 either. Are Michael and Administrator on the same homeserver as the client taking the screenshot?

I should have mentioned that. Michael (me) and Administrator (mjolnir, moderation bot) and the person who made the screenshot (also me, from the "Michael Account") are on the same homeserver. They are in another room with 6 more users, some from other homeservers.

Sunday, which was one or two days before the screenshot was made, Michael sent a message at 13:24 to the room: "mjolnir room remove [...]" and Administrator answered with the message: "⚠ | Failed to leave [...]". Almost fifty minutes later, I restarted the Administrator (mjolnir) and it sent four additional messages to the room at 14:09. Around the time I took the screenshot, one or two days after, what I described and after a few more messages were sent to the room, the same events appeared again in the timeline. As before, they have the same event ids as the ones which are already in the room. Those "duplicates" in the timeline (same event ids) are in the rectangle with the red border marked as "1". For every message, a notification on element-android and on element-web (I haven't checked other clients) was created at 11:46 within a few milliseconds (and yes, notifications are set to "all messages" for that room). Thereafter, (also at 11:46) Administrator executed the "duplicated" command (the second rectangle with the red border, marked as "2"). Because the room is federated to/with matrix.org I was able to check the timeline there, which was consistent with what I saw on my server. This means, the original messages were there when they were sent on Sunday, and their "duplicates", with the same event ids, one or two days later. Even after clearing the cache on both ends.

That's very unfortunate. Please reach out to us if you end up in that situation; it's generally very unsafe to manually operate on Synapse's database (I hope you found all the tables that reference a room ID).

I think so. Everything is up and running again, and I could join the room again afterwards. Besides some missing prev_events everything looks alright. The room was stored in 55 tables, some tricky to find, as they don't have a room_id field, like event_relations, event_auth_chains and event_edges. It actually would have been nice, though, if the Delete Room API was able to purge a room from the database, even though users couldn't leave.

DMRobertson commented 1 year ago

"As far as I know" because I have retention enabled in the homeserver.yml

That's an amber flag: retention is known to be buggy (e.g. vector-im/element-web#13476). Are you still seeing the duplicate event pattern today (even if sporadically?) If so I'd suggest disabling retention to see if that stops the problem.

Because the room is federated to/with matrix.org I was able to check the timeline there, which was consistent with what I saw on my server. This means, the original messages were there when they were sent on Sunday, and their "duplicates", with the same event ids, one or two days later.

That is extremely surprising (and somewhat alarming). Are you able to share the room ID so we can cross-reference with matrix.org's logs and database? I can be reached privately at @dmrobertson:matrix.org if you would prefer not to post it publicly.

It actually would have been nice, though, if the Delete Room API was able to purge a room from the database, even though users couldn't leave.

It has a force_purge option for this, IIRC.

MichaelSasser commented 1 year ago

That's an amber flag: retention is known to be buggy [...]

I've noticed minor issues pretty quickly after I turned it on a year ago or so.

If so I'd suggest disabling retention to see if that stops the problem.

I've thought about that, too. However, I first wanted to try out first whether the Purge History API could fix/mitigate the problem, like it usually does with retention related issues, which it actually did. Since I purged the room history down to the second last event, I haven't noticed any of the issues, I described. Out of caution, I've disabled retention now.

Are you able to share the room ID so we can cross-reference with matrix.org's logs and database?

Sure. The room in image 1 is #python.matrix.org (!iuyQXswfjgxQMZGrfQ:matrix.org) (public, non-e2ee room) and the other room (image 2), is !ReeNuqUuyQhMKliZVe:michaelsasser.org (invite only, non-e2ee room). If you want to join the room, please don't hesitate to ask me @michael:michaelsasser.org, or use my matrix.org account @michaelsasser:matrix.org to /join yourself to the room. We have replaced it with a fresh room.

About the room in image 2: While going through the room history on matrix.org, I've noticed many out-of-order events, which were in order, when I verified the situation on before opening this issue. While, scrolling through the room history, one of the "duplicated" messages vanished after a few minutes. Now, I can only find the message in the original place (when the event was created), surrounded by out-of-order messages.

It has a force_purge option for this, IIRC.

Oh, right, it has... This is embarrassing now :smile:. Either it was added after I've added the functionality to my CLI tool, or I've missed it twice in a row. Thanks for the hint!

johndball commented 1 year ago

I've been following this thread since it was opened. Both of my servers are experiencing the same issue. I'm happy to share log data as needed. In a nutshell, after all messages are read, the app (desktop, web, mobile) will still show unread notifications. Additionally, the notification count is way off. For example, one new message will show up as five unread messages and an active conversation with multiple messages will show in the thirties, forties, fifties, etc.

I do have retention enabled on both servers with server-specified policy and room policies.

retention:
  # The message retention policies feature is disabled by default. Uncomment the
  # following line to enable it.
  #
  enabled: true

Will continue to be a fly on the wall while y'all work through this thread.

MadLittleMods commented 1 year ago

Related to https://github.com/matrix-org/synapse/issues/11447

MichaelSasser commented 1 year ago

I've handed in another /rageshake when I saw the following after opening element (web):

explorer_XLo46wFcav

After sending the /rageshake I've noticed that this probably has nothing to do with this issue. The difference to this issue is:


Room: #python.matrix.org (!iuyQXswfjgxQMZGrfQ:matrix.org; public, non-e2ee room)

ArrayBolt3 commented 1 year ago

This problem seems to be getting worse as time goes by - I opened up a few rooms hosted on a servers affected by this issue (using the Element web app to open the rooms), and proceeded to get hit with approximately 30 notifications per chat message that I was receiving notifications for. And it was sending me notifications for quite a few chat messages even those these messages had been sent quite a bit ago, resulting in probably in excess of 200 notifications having to be closed. (And for some reason Chrome made me close every single notification manually or wait for them to disappear in small batches by themselves, but that's a Chrome problem or a problem with my setup.)

MichaelSasser commented 1 year ago

Might be related to https://github.com/vector-im/element-web/issues/897 and https://github.com/vector-im/element-meta/issues/1491.

colemickens commented 11 months ago

This has gotten bad lately. I'm getting this very frequently when I'm mentioned (at least, in one room in particular, not sure if it's happening in all rooms yet)