irrdnet / irrd

IRRd version 4
http://irrd.readthedocs.io/en/stable/
Other
157 stars 52 forks source link

Possible issue in serial sequences on current RADB NRTM #877

Open dhutchin-ctl opened 12 months ago

dhutchin-ctl commented 12 months ago

Describe the bug It looks like serial numbers can sometimes be corrupted. Once they are corrupted, NRTM processing of further updates fails. The LEVEL3 instance of IRRdv4 is mirroring RADB. Consider the following entries from irrd.log (grep RADB):

2023-12-04 13:39:13,137 irrd[210576]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-04 13:39:13,175 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 255031, force_reload: False, nrtm enabled: True 2023-12-04 13:39:13,176 irrd[1188550]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43 2023-12-04 13:39:13,176 irrd[1188550]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:255032-LAST on nrtm.radb.net port 43 2023-12-04 13:39:13,300 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: % Warning: there are no newer updates available

Above is the last entries where I got expected behavior. The "import_timer" seconds (900) later the following occurred.

2023-12-04 13:54:14,468 irrd[210576]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-04 13:54:14,506 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 255031, force_reload: False, nrtm enabled: True 2023-12-04 13:54:14,507 irrd[1189565]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43 2023-12-04 13:54:14,508 irrd[1189565]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:255032-LAST on nrtm.radb.net port 43 2023-12-04 13:54:14,622 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 255032 - 268299 do not exist

Starting at 2023-12-04 13:54:14, "255032-LAST" seems to be interpreted as "255032 - 268299" and produces an error about non-existent serials. The error persisted (for over 24 hours) until I performed "irrd_mirror_force_reload RADB". Some RADB updates were missed during this period. The updates were recovered when I did the force_reload. I suspect (without great evidence) that the error occurred in RADB's instance of IRRd. I have seen this same situation more than a year ago (so long before IRRdv4 4.4.2). I don't know if I am running the same version of IRRd now as I was the last time I saw this (but maybe).

To Reproduce nrtm_host is set to nrtm.radb.net Unknown how to reproduce the problem

Expected behaviour

IRRd version you are running I am running IRRdv4 4.2.6, I don't know what version RADB is running (suspect 4.4.2)

Additional context None

troy2914 commented 12 months ago

NTT is seeing the same thing on 4.2.8

troy2914 commented 12 months ago

And for the record we have had to irrd_mirror_force_reload RADB at least 3 times since their cutover.

dhutchin-ctl commented 12 months ago

I have had a total of 5 instances where we got into this mode since November of 2022. Below is the fist error message in each of the instances. Maybe these match up with Troy2914's instances:

2022-11-02 01:50:33,793 irrd[1845133]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for LACNIC: %% ERROR: Serials 103009 - 110626 do not exist 2023-11-13 18:02:33,554 irrd[3741814]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 50 - 22293297 do not exist 2023-11-17 22:17:22,884 irrd[2275035]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 47968 - 57319 do not exist 2023-11-25 03:48:12,089 irrd[1533810]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 115384 - 123994 do not exist 2023-12-05 03:10:27,547 irrd[1687948]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 255032 - 268299 do not exist

dhutchin-ctl commented 12 months ago

In my log file, I only see "Received NRTM response for ... Serials .... do not exist" occasionally. For instance, the RADB lines above. However, for queries to the LEVEL3 IRR, I see "encountered parsing error while parsing query "-g LEVEL3.... Serials .... do not exist" all the time. In all the cases where the query is TO the LEVEL3 IRR, the query range is and the reported non-existent serials is in the range of a small number to the number specified in the query. This is not the case with the RADB queries.

mxsasha commented 12 months ago

This is an interesting one. Given that you both experience issues, and some people raised issues about the IRRexplorer instance, a best guess is that RADB's NRTM streams are the common part here. I'll see if my own logs show anything on this. Also wondering if @schelcj has ideas, or whether the timings align with something? I do not know RADB's config, or whether they do anything unusual.

Current implementation

Let me run through how this is implemented so we all know precisely what messages mean and for my own overview. Some of these choices may not be the clearest, but were made for backwards compatibility.

On the client side, IRRD keeps a serial_newest_mirror which only relates to the mirror source. When running NRTM, it requests (serial_newest_mirror+1)-LAST, and logs the response. Received NRTM messages are processed by the NRTMStreamParser, which also updates serial_newest_seen to be the highest serial from the NRTM ADD/DEL operations in the response. This means that we expect NRTMv3 clients to request non-existing serials routinely, which are one beyond the actual range that the server has.

On the server side, responses are built by the NRTMGenerator, which takes a serial_start_requested and optional serial_end_requested. From the RPSLDatabaseStatus, we query serial_oldest_journal and serial_newest_journal to determine serial_start_available and serial_end_available. In our case, serial_end_requested would always be None, i.e. LAST. Then there's a few tests and responses that may apply:

  1. serial_start_requested < serial_start_available errors: Serials {serial_start_requested} - {serial_start_available} do not exist - you are querying older serials than are known. serial_end_available
  2. serial_start_requested == serial_end_available + = warns: there are no newer updates available - you are querying just beyond the range.
  3. serial_start_requested > serial_end_available errors: Serials {serial_end_available} - {serial_start_requested} do not exist - the serials you are querying are way newer than are known.

Log sample 1

2023-12-04 13:39:13,176 irrd[1188550]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43 2023-12-04 13:39:13,176 irrd[1188550]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:255032-LAST on nrtm.radb.net port 43 2023-12-04 13:39:13,300 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: % Warning: there are no newer updates available

This is normal behaviour that suggests you are up to date.

Log sample 2

2023-12-04 13:54:14,507 irrd[1189565]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43 2023-12-04 13:54:14,508 irrd[1189565]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:255032-LAST on nrtm.radb.net port 43 2023-12-04 13:54:14,622 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 255032 - 268299 do not exist

This request and response are scenario 1: the remote journal starts at 268299, but you're asking for 255032. You are too far behind, it doesn't have the entries older than 268299. This is only recoverable with a full reload.

There are a few possible causes for this situation: your IRRD has been down for a while, the NRTMv3 server keeps only very short journals so the window is very narrow, or there has been a persistent issue before this which was preventing you from keeping up with NRTM. You'll have to dig more into your logs to figure out which, or perhaps RADB can tell us more about anything unusual on their end.

Other log samples

It is not possible to know what was happening there, as you need to know both the query and the response to figure out whether this is scenario 1 or 3.

teunvink commented 12 months ago

As one of the operators of IRRExplorer (which uses irrd as one of its backends) I can confirm that we've seen similar problems over the past months as well. We're currently running irrd 4.4.2.

dhutchin-ctl commented 12 months ago

Notice Log Sample 2 is exactly 15 minutes after Log Sample 1. There were no RADB updates on the client side between those two samples. So the first serial to request on the client side should not change. I don’t know where “268299” comes from.

I will make myself available for you when ever you want to see more directly into the LEVEL3 system. I know, that doesn’t give you access to the RADB system at the same time. And of course, we don’t know how to trigger the error. I’m writing a script to more closely monitor the log so I will know shortly after this condition pops its head again.

D

schelcj commented 12 months ago

We've seen the same internally and have been investigating as well. We saw the same serial constraint violation for the same serial as @dhutchin-ctl. Our nrtm.radb.net is fed via nrtm from the authoritative host and when this error hit any futher updates were blocked. A reseed resolved the issue but fairly certain we saw this once before after the v4 migration. We are also adding more monitoring and still investigating.

mxsasha commented 12 months ago

Notice Log Sample 2 is exactly 15 minutes after Log Sample 1. There were no RADB updates on the client side between those two samples. So the first serial to request on the client side should not change.

Ah right, that gives a bit more context. Using my earlier comment on NRTM responses and their precise meaning, we can narrow down.

We can assume the client keeps requesting 255032-LAST. At first, RADB replies:

2023-12-04 13:39:13,300 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: % Warning: there are no newer updates available

This is expected in a correctly running setup where you are up to date. This message means RADB has journal entries up to and including 255031, and you have those too in the mirror. All good.

15 minutes later, RADB replies:

2023-12-04 13:54:14,622 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 255032 - 268299 do not exist

(The assumption that this is the same 255032-LAST query is consistent with the fact that the response mentions this serial. So we can be quite sure the same query was used both times.)

RADB now reports that its first serial is actually 268299 in the journal. That means it jumped 13267 serials in 15 minutes. 13K changes in 15 minutes is rare but not impossible, but what is strange here is that the journal did not just grow with at least 13K entries, the oldest 13K were also removed from the journal. It is impossible for clients to keep up in that case, or recover from that situation.

There are two scenarios I can think of:

troy2914 commented 12 months ago

if I am reading your two scenarios properly, those are both things to address on the RADB side?

schelcj commented 12 months ago

For the serials jumping we have a maintainer whose automated processes may have gone sideways and producing a lot of unnecessary updates. We are reaching out to them to address.

troy2914 commented 12 months ago

For the serials jumping we have a maintainer whose automated processes may have gone sideways and producing a lot of unnecessary updates. We are reaching out to them to address.

that sounds like a 4.4.3 request :) to add a limit/throttle.

schelcj commented 12 months ago

Back to the original issue; reviewing logs this morning shows that our nrtm.radb.net host retrieved, and applied, updates up to and shortly after 225031 then for unknown reasons tried to reapply that serial and ran into the unique key constraint preventing all further updates. We only had logging at INFO level so there isn't much more to go on at this time but have enabled DEBUG logging to review if this happens again.

2023-12-02 21:47:20,121 irrd[253050]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/255031/ADD/route6/2605:9CC0:397::/48AS8987
2023-12-02 21:47:20,122 irrd[253050]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/255032/ADD/route6/2605:9CC0:399::/48AS8987                                         
2023-12-02 21:47:20,123 irrd[253050]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/255033/ADD/route/130.137.120.0/24AS8987                                            
2023-12-02 21:47:20,135 irrd[253050]: [irrd.mirroring.mirror_runners_import#ERROR] An exception occurred while attempting a mirror update or initial import for RADB: (psycopg2.errors.Unique
DETAIL:  Key (serial_nrtm, source)=(255031, RADB) already exists.                                                                                                                           
mxsasha commented 12 months ago

Back to the original issue; reviewing logs this morning shows that our nrtm.radb.net host retrieved, and applied, updates up to and shortly after 225031 then for unknown reasons tried to reapply that serial and ran into the unique key constraint preventing all further updates. We only had logging at INFO level so there isn't much more to go on at this time but have enabled DEBUG logging to review if this happens again.

Ok, we can see what those logs produce, but the uniqueness in the database isn't something that should ever be hit really - it's there as a safeguard against data corruption. I also can't immediately think of a case where this can happen, but can not say more about this anyways without knowing which instances you are running, what their configs are, and whether there's any replication going on. Further logs will also help, e.g. I am wondering whether this instance has received 255031 over NRTMv3 before and is trying to insert it twice, or whether it came up with that serial on its own somehow (which it should not).

schelcj commented 12 months ago

Ok, we can see what those logs produce, but the uniqueness in the database isn't something that should ever be hit really - it's there as a safeguard against data corruption. I also can't immediately think of a case where this can happen, but can not say more about this anyways without knowing which instances you are running, what their configs are, and whether there's any replication going on.

Working getting further logs, there is no replication for the hosts in question.

Further logs will also help, e.g. I am wondering whether this instance has received 255031 over NRTMv3 before and is trying to insert it twice, or whether it came up with that serial on its own somehow (which it should not).

That appears to be what has happened. 255031 was received over nrtmv3, applied, subsequent updates were received and applied, then the update for 255031 was attempted again.

troy2914 commented 11 months ago

btw, this seems to have happened again on the 25th of December.

troy2914 commented 11 months ago

btw, this seems to have happened again on the 25th of December.

with serial number 530173

dhutchin-ctl commented 11 months ago

I think the problem started earlier than the 25th. Full disclosure... I do a irrd_mirror_force_reload on RADB once a week via cron. The failure coincides with that action. Coincidence? Hard to believe that. Here are some (maybe not all) pertinent entries from log file. Start at 12-23 16:52z, the time of the last appearingly normal NRTM update (we update every 15 minutes):

2023-12-23 16:52:08,351 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-23 16:52:08,413 irrd[1995035]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 530173, force_reload: False, nrtm enabled: True 2023-12-23 16:52:08,413 irrd[1995035]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 530174 on nrtm.radb.net:43 2023-12-23 16:52:08,413 irrd[1995035]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:530174-LAST on nrtm.radb.net port 43 2023-12-23 16:52:08,526 irrd[1995035]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: % Warning: there are no newer updates available

So at that time, my interpretation is that there are no updates newer than serial 530173. At 12-23 17:00z I ran irrd_mirror_force_reload: 2023-12-23 17:00:03,417 irrd[1995709]: [irrd.storage.database_handler#INFO] force_reload flag set for RADB, serial synchronisation based on current settings: False

At the next NRTM interval (17:07), I see that the current serial has jumped to 535978, but the import via FTP seems to go OK: 2023-12-23 17:07:09,721 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-23 17:07:09,762 irrd[1996872]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 530173, force_reload: True, nrtm enabled: True 2023-12-23 17:07:09,763 irrd[1996872]: [irrd.mirroring.mirror_runners_import#INFO] Running full RPSL import of RADB from ['ftp://ftp.radb.net/radb/dbase/radb.db.gz'], serial from ftp://ftp.radb.net/radb/dbase/RADB.CURRENTSERIAL 2023-12-23 17:07:10,130 irrd[1996872]: [irrd.mirroring.mirror_runners_import#INFO] Downloaded ftp://ftp.radb.net/radb/dbase/RADB.CURRENTSERIAL, contained 535978 2023-12-23 17:07:17,123 irrd[1996872]: [irrd.mirroring.parsers#DEBUG] Starting file import of RADB from /app/irrd/tmp/tmprj8ayrgz 2023-12-23 17:21:57,939 irrd[1996872]: [irrd.mirroring.parsers#INFO] File import for RADB: 1390207 objects read, 1390207 objects inserted, ignored 0 due to errors, ignored 0 due to object_class_filter, source /app/irrd/tmp/tmprj8ayrgz

Then at the next NRTM interval (which is only 13 seconds later) is when things start to go haywire (hmmm, where did that expression come from?): 2023-12-23 17:22:11,109 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-23 17:22:11,151 irrd[2001141]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 535978, force_reload: False, nrtm enabled: True 2023-12-23 17:22:11,151 irrd[2001141]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 535979 on nrtm.radb.net:43 2023-12-23 17:22:11,151 irrd[2001141]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:535979-LAST on nrtm.radb.net port 43 2023-12-23 17:22:11,261 irrd[2001141]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 530173 - 535979 do not exist

And things stay like that until 12-25 12:56z when the only change is that the range of serials reported in the RADB response to our query (same query as above) is "535979 - 543646". And that continues every 15 minutes until 12-27 15:26 when I did another irrd_mirror_force_reload: **2023-12-27 15:15:26,375 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-27 15:15:26,414 irrd[1574475]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 535978, force_reload: False, nrtm enabled: True 2023-12-27 15:15:26,415 irrd[1574475]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 535979 on nrtm.radb.net:43 2023-12-27 15:15:26,415 irrd[1574475]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:535979-LAST on nrtm.radb.net port 43 2023-12-27 15:15:26,543 irrd[1574475]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 535979 - 543646 do not exist

2023-12-27 15:26:40,040 irrd[1578473]: [irrd.storage.database_handler#INFO] force_reload flag set for RADB, serial synchronisation based on current settings: False

2023-12-27 15:30:27,663 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-27 15:30:27,704 irrd[1578781]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 535978, force_reload: True, nrtm enabled: True 2023-12-27 15:30:27,704 irrd[1578781]: [irrd.mirroring.mirror_runners_import#INFO] Running full RPSL import of RADB from ['ftp://ftp.radb.net/radb/dbase/radb.db.gz'], serial from ftp://ftp.radb.net/radb/dbase/RADB.CURRENTSERIAL 2023-12-27 15:30:28,082 irrd[1578781]: [irrd.mirroring.mirror_runners_import#INFO] Downloaded ftp://ftp.radb.net/radb/dbase/RADB.CURRENTSERIAL, contained 559866 2023-12-27 15:30:35,742 irrd[1578781]: [irrd.mirroring.parsers#DEBUG] Starting file import of RADB from /app/irrd/tmp/tmp2oew3wc6 2023-12-27 15:45:45,384 irrd[1578781]: [irrd.mirroring.parsers#INFO] File import for RADB: 1391073 objects read, 1391073 objects inserted, ignored 0 due to errors, ignored 0 due to object_class_filter, source /app/irrd/tmp/tmp2oew3wc6**

Now everything is looking reasonable: **2023-12-27 15:45:59,206 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-27 15:45:59,246 irrd[1580419]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 559866, force_reload: False, nrtm enabled: True 2023-12-27 15:45:59,247 irrd[1580419]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 559867 on nrtm.radb.net:43 2023-12-27 15:45:59,247 irrd[1580419]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:559867-LAST on nrtm.radb.net port 43 2023-12-27 15:45:59,814 irrd[1580419]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %START Version: 3 RADB 559867-568497 2023-12-27 15:45:59,833 irrd[1580419]: [irrd.mirroring.parsers#DEBUG] Found valid start line for RADB, range 559867-568497

2023-12-27 16:01:00,553 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-27 16:01:00,592 irrd[1583183]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 568497, force_reload: False, nrtm enabled: True 2023-12-27 16:01:00,592 irrd[1583183]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 568498 on nrtm.radb.net:43 2023-12-27 16:01:00,592 irrd[1583183]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:568498-LAST on nrtm.radb.net port 43 2023-12-27 16:01:00,748 irrd[1583183]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %START Version: 3 RADB 568498-568572 2023-12-27 16:01:00,748 irrd[1583183]: [irrd.mirroring.parsers#DEBUG] Found valid start line for RADB, range 568498-568572

2023-12-27 16:16:02,170 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-27 16:16:02,210 irrd[1584674]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 568572, force_reload: False, nrtm enabled: True 2023-12-27 16:16:02,211 irrd[1584674]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 568573 on nrtm.radb.net:43 2023-12-27 16:16:02,211 irrd[1584674]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:568573-LAST on nrtm.radb.net port 43 2023-12-27 16:16:02,387 irrd[1584674]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %START Version: 3 RADB 568573-568667 2023-12-27 16:16:02,388 irrd[1584674]: [irrd.mirroring.parsers#DEBUG] Found valid start line for RADB, range 568573-568667**

Have I left anything out of this report that might be in our logs? Please let me know. RADB folks please chime in here.

troy2914 commented 11 months ago

we did not do a force reload until about 20utc on the 25th. And had our noc contact db-admin@radb.net

schelcj commented 11 months ago

We saw the same sql constraint violation but being the holidays I haven't dug into the logs yet. A reseed was required internally again.

schelcj commented 9 months ago

Reviewing a recent incident of this constraint violation and noticed that this exception is thrown by an ADD that is going through the delete code path. The delete path flushes the _rpsl_upsert_buffer first. Would this mean that the add was handled previously during another flush but the buffer was not cleared before this delete happened? These operations are for and DEL and then a ADD for the same prefix in a very short time window (within the same second sometimes).

2024-02-19 00:01:18,878 irrd[2309452]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1176341/ADD/route6/2605:9CC0:397::/48AS8987
2024-02-19 00:01:18,879 irrd[2309452]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1176342/ADD/route6/2605:9CC0:399::/48AS8987
2024-02-19 00:01:18,891 irrd[2309452]: [irrd.mirroring.mirror_runners_import#ERROR] An exception occurred while attempting a mirror update or initial import for RADB: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "rpsl_objects_history_serial_nrtm_source_unique"
DETAIL:  Key (serial_nrtm, source)=(1176341, RADB) already exists.

[SQL: INSERT INTO rpsl_database_journal (serial_global, rpsl_pk, source, origin, serial_nrtm, operation, object_class, object_text, timestamp) VALUES (nextval('rpsl_database_journal_serial_global_seq'), %(rpsl_pk)s, %(source)s, %(origin)s, %(serial_nrtm)s, %(operation)s, %(object_class)s, %(object_text)s, %(timestamp)s) RETURNING rpsl_database_journal.serial_nrtm]
[parameters: {'rpsl_pk': '2605:9CC0:397::/48AS8987', 'source': 'RADB', 'origin': 'mirror', 'serial_nrtm': 1176341, 'operation': 'add_or_update', 'object_class': 'route6', 'object_text': 'route6:         2605:9cc0:397::/48\norigin:         AS8987\ndescr:          Amazon EC2 Prefix\nmnt-by:         MAINT-AS16509\nsource:         radb\nchanged:        noc@amazon.com 20240218 #18:43:49\nlast-modified:  2024-02-18T18:43:50Z\n', 'timestamp': datetime.datetime(2024, 2, 19, 0, 1, 18, 889869, tzinfo=datetime.timezone.utc)}]
(Background on this error at: http://sqlalche.me/e/13/gkpj)
Traceback (most recent call last):
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "rpsl_objects_history_serial_nrtm_source_unique"
DETAIL:  Key (serial_nrtm, source)=(1176341, RADB) already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/mirroring/mirror_runners_import.py", line 66, in run
    self.update_stream_runner.run(serial_newest_mirror, database_handler=self.database_handler)
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/mirroring/mirror_runners_import.py", line 442, in run
    operation.save(database_handler)
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/mirroring/nrtm_operation.py", line 96, in save
    database_handler.delete_rpsl_object(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/storage/database_handler.py", line 577, in delete_rpsl_object
    self._flush_rpsl_object_writing_buffer()
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/storage/database_handler.py", line 739, in _flush_rpsl_object_writing_buffer
    self.status_tracker.record_operation(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/storage/database_handler.py", line 1043, in record_operation
    insert_result = self.database_handler.execute_statement(stmt).fetchone()
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/irrd/storage/database_handler.py", line 239, in execute_statement
    return self._connection.execute(statement)
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context
    self._handle_dbapi_exception(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception
    util.raise_(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(
  File "/home/irrd/srv/releases/4.4.2/venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 608, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "rpsl_objects_history_serial_nrtm_source_unique"
DETAIL:  Key (serial_nrtm, source)=(1176341, RADB) already exists.

Snippet of logs of operations on the same prefix. why they are doing this I don't know.

2024-02-19 07:04:37,850 irrd[2394162]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180490/ADD/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:04:37,872 irrd[2394162]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180495/DEL/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:04:37,874 irrd[2394162]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180498/ADD/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:04:37,886 irrd[2394162]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180501/DEL/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:06:38,470 irrd[2394379]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180507/ADD/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:07:38,649 irrd[2394426]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180513/DEL/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:08:39,302 irrd[2394464]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180743/ADD/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:08:39,307 irrd[2394464]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180744/DEL/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:09:39,113 irrd[2394507]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180745/ADD/route6/2605:9CC0:397::/48AS8987
2024-02-19 07:09:39,131 irrd[2394507]: [irrd.mirroring.nrtm_operation#INFO] Completed NRTM operation RADB/1180748/DEL/route6/2605:9CC0:397::/48AS8987
mxsasha commented 9 months ago

This is very useful data, and the issue is then probably somewhere in the handling of the upsert buffer or the insert/delete queries. I will dig into the code.

mxsasha commented 9 months ago

So, the duplicate key error here concerns the journal, where, as a safeguard, serial and source are unique. NRTMv3 exports would otherwise break. The serial_nrtm has three possible sources:

The serial actually inserted by the DB is then added to the know new serials, for use for the next journal entry created and some metadata. This is not thread/multiprocess safe, but there should only be one process running to mirror one source, as ensured by the mirroring scheduler.

Questions:

schelcj commented 9 months ago
  • What is the configuration for the RADB source on the instance that generated this error? I'm assuming it's an NRTMv3 client from another host?
        RADB:
            authoritative: false
            keep_journal: true
            import_serial_source: "file:///<path>/radb/dbase/RADB.CURRENTSERIAL"
            import_source: "file:///<path>/radb/dbase/radb.db.gz"
            import_timer: 60
            nrtm_host: "<auth_db_ip>"
            nrtm_port: "4343"
            nrtm_access_list: nrtm_public
  • What is the output of !JRADB on this instance? Particularly useful to know if it is running with synchronised serials.

rpki is disabled, no scope filters, and route object prefs disabled on this host.

$ echo '!JRADB' | nc localhost 4343
A526
{
    "RADB": {
        "source_type": "regular",
        "authoritative": false,
        "object_class_filter": null,
        "rpki_rov_filter": false,
        "scopefilter_enabled": false,
        "route_preference": null,
        "local_journal_kept": true,
        "serial_oldest_journal": 1246157,
        "serial_newest_journal": 1256489,
        "serial_last_export": null,
        "serial_newest_mirror": 1256489,
        "last_update": "2024-02-27T21:25:00.799334+00:00",
        "synchronised_serials": true
    }
}
C
  • What are the log entries of this instance for any entry mentioning 1176330-1176350?
  • If synchronised serials are enabled, what are the log entries of the NRTMv3 source for those serials, if any?

The log level was reset to INFO so the logs are not particularly interesting. I've reset the level to DEBUG to catch the next time. If you still want the logs I'll create a gist and send it.

  • Is this a single instance that logged this exact error, or do you have other mirror clients that produced the same error?

Have only seen this error on our host that serves nrtm.radb.net.

mirceaulinic commented 3 months ago

I'm on IRRd 4.4.3 and seeing something very similar - I hope this is the right issue to report this to.

There are a few sources (not just RADB) that sometimes are out of sync and I need to kick a manual irrd_mirror_force_reload in order to force an update. When I noticed this for the first time, they haven't been updated in many months. I'm noticing this with RADB, APNIC, NTTCOM, and sometimes LACNIC.

Status:

IRRD version 4.4.3

--------------------------------------------------------------------------
 source    total obj     rt obj    aut-num obj      serial    last export 
--------------------------------------------------------------------------               
 APNIC       2726850     744889          22412        None                             
 NTTCOM       372257     363477            568        None                
 RADB        1395174    1170400           9759        None  

Samples:

!JRADB
A520
{
    "RADB": {
        "source_type": "regular",
        "authoritative": false,
        "object_class_filter": null,
        "rpki_rov_filter": false,
        "scopefilter_enabled": false,
        "route_preference": null,
        "local_journal_kept": true,
        "serial_oldest_journal": null,
        "serial_newest_journal": null,
        "serial_last_export": null,
        "serial_newest_mirror": 2052890,
        "last_update": "2024-08-19T16:52:36.911378+00:00",
        "synchronised_serials": true
    }
}
C
!JNTTCOM
A522
{
    "NTTCOM": {
        "source_type": "regular",
        "authoritative": false,
        "object_class_filter": null,
        "rpki_rov_filter": false,
        "scopefilter_enabled": false,
        "route_preference": null,
        "local_journal_kept": true,
        "serial_oldest_journal": null,
        "serial_newest_journal": null,
        "serial_last_export": null,
        "serial_newest_mirror": 1988580,
        "last_update": "2024-08-19T16:26:12.047086+00:00",
        "synchronised_serials": true
    }
}

The last_update field practically shows the time when the refresh was executed after running irrd_mirror_force_reload, whereas other sources have a more recent timestamp, while serial_oldest_journal and serial_newest_journal have some actual numbers.

I'm not seeing any error, in fact I do see the NRTM updates:

2024-08-20 08:26:03,669 irrd[681]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for LACNIC from serial 428757 on irr.lacnic.net:43
2024-08-20 08:26:03,671 irrd[680]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for APNIC from serial 12493555 on whois.apnic.net:43
2024-08-20 08:26:03,672 irrd[682]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for NTTCOM from serial 1988577 on rr.ntt.net:43

But they just don't seem to trigger any update, I suspect it might be caused by the serial_oldest_journal and serial_newest_journal being nil in the database:

irrd=# select serial_oldest_journal,serial_newest_journal from database_status where source='APNIC';
 serial_oldest_journal | serial_newest_journal 
-----------------------+-----------------------
                       |                      
(1 row)
mxsasha commented 3 months ago

I'm on IRRd 4.4.3 and seeing something very similar - I hope this is the right issue to report this to.

As a client, your local serial_newest/oldest_journal are not too relevant, the important field is serial_newest_mirror which identifies the latest known serial from the mirror server, i.e. NRTM will next start from that serial + 1. Right now, NTT is on 1988846 (per !JNTTCOM on rr.ntt.net), so your numbers are in the right ballpark. If I query -g NTTCOM:3:1988577-LAST from rr.ntt.net, I get a reasonable number of updates.

You are the first person to report this with other mirror servers than RADB. Perhaps the issue is wider, perhaps there is a different issue involved. The first step is to set the log level to DEBUG, and to provide me all logs relating to the affected sources starting from the forced reload, when this reoccurs.

schelcj commented 3 months ago

We are still hitting up against duplicate key value violates unique constraint "rpsl_objects_history_serial_nrtm_source_unique" errors. Much less frequently now after resolving issues with a maintainer but we still see them. @mxsasha did you ever find anything on this error?