sul-dlss / FOLIO-Project-Stanford

Task management for Stanford’s analysis of FOLIO.
2 stars 0 forks source link

track down missing SRS records in MetaDB #603

Closed shelleydoljack closed 7 months ago

shelleydoljack commented 8 months ago

see slack thread: https://stanfordlib.slack.com/archives/C03MB1KHR89/p1705960705553329

shelleydoljack commented 8 months ago

@ahafele for the first instance HRID in the Book2.csv file shared in slack, I did this:

Now I'm not sure which record version to take. There are __current and state columns. Interestingly, there are many source records with ACTUAL state. If I take the value of the id column for one of them and query okapi I get some data: /source-storage/records/34790208-7c5a-49b0-8c5c-9af6e721d92d.

If I adjust the metadb query to external_hrid = 'in00000047302' and "__current", all of them are state = OLD. Changing the query again to external_hrid = 'in00000047302' and state = 'ACTUAL' and take id of the latest updated date and query okapi: /source-storage/records/0ba61f07-33ff-4315-b10e-d5857c439459 I get a record with state = OLD. I cannot seem to query that endpoint by externalIdsHolder.InstanceId== but I can do it by snapshotId (which we can get from metadb). snapshotId 3d586f6f-ad86-462c-8037-dbc8f257487d (the latest updated date in metadb table) returns 2 OLD SRS records. So anyways, I'm just back to "not sure which SRS records is the latest". And this is just for the first in that spreadsheet.

shelleydoljack commented 8 months ago

I could try to just use what metadb says is the latest updated record and somehow connect it to the instance record and see if that works for EAL. If so, we could just use that as a general rule to fix these. :shrug:

ahafele commented 8 months ago

I think that approach makes sense. Based on what you said these records don't fall into the records described here - https://folio-org.atlassian.net/wiki/x/uj8V Wondering if it is possible there are other SRS out there that have the instance hrid in the 999$i and not the external_hrid ?

shelleydoljack commented 8 months ago

I'm not sure. At least for this first one, instance Id b0dafd3b-5e31-47bc-8d29-591520331587 is not found when doing a GET to /source-storage/source-records?instanceId=b0dafd3b-5e31-47bc-8d29-591520331587 or a GET to /source-storage/records/b0dafd3b-5e31-47bc-8d29-591520331587/formatted?idType=INSTANCE but it is in the field:

"externalIdsHolder":{
  "instanceId": "b0dafd3b-5e31-47bc-8d29-591520331587",
  "instanceHrid": "in00000047302"
}

when you do a GET to /source-storage/records/{sourceRecordId gotten from records_lb table in metadb}.

shelleydoljack commented 8 months ago

I logged into the db and did a query on that table: select * from sul_mod_source_record_storage.records_lb where external_id = 'b0dafd3b-5e31-47bc-8d29-591520331587' order by updated_date DESC;

I get 7 rows as a result.

                  id                  |             snapshot_id              |              matched_id              | generation | record_type |             external_id              | state | leader_record_status | order | suppress_discovery |          created_by_user_id          |        created_date        |          updated_by_user_id          |         updated_date          | external_hrid 
--------------------------------------+--------------------------------------+--------------------------------------+------------+-------------+--------------------------------------+-------+----------------------+-------+--------------------+--------------------------------------+----------------------------+--------------------------------------+-------------------------------+---------------
 34790208-7c5a-49b0-8c5c-9af6e721d92d | 3d586f6f-ad86-462c-8037-dbc8f257487d | 02b500bc-8395-42df-967a-3a870a1b7215 |          2 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-16 18:29:21.124-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:30.009146-08 | in00000047302
 5b554e74-9db9-4cbe-90e3-e4811c5f253a | ad50bbe6-86ff-48d9-b2bc-323da9aeadd4 | 02b500bc-8395-42df-967a-3a870a1b7215 |          3 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-16 18:31:04.931-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:30.00888-08  | in00000047302
 0ba61f07-33ff-4315-b10e-d5857c439459 | 3d586f6f-ad86-462c-8037-dbc8f257487d | 02b500bc-8395-42df-967a-3a870a1b7215 |          6 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:25:34.806-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:30.007377-08 | in00000047302
 70c1bdb8-93df-40f0-81df-2246fc32afa4 | 691b5adc-6196-4aa6-bdfb-024e9bca16c4 | 02b500bc-8395-42df-967a-3a870a1b7215 |          1 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-16 18:28:58.241-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:30.006096-08 | in00000047302
 36f40463-2cbb-4603-9c0a-ba37225cc926 | 96a0d6f4-77e4-40d1-9ad8-74e7034379a7 | 02b500bc-8395-42df-967a-3a870a1b7215 |          4 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-17 11:24:12.515-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:30.00609-08  | in00000047302
 9f5e65f0-3770-428a-878b-bca2a435af4e | 860012f1-40d8-4726-b9b3-6fe1e033eb4c | 02b500bc-8395-42df-967a-3a870a1b7215 |          5 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-17 12:38:32.597-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:30.005938-08 | in00000047302
 02b500bc-8395-42df-967a-3a870a1b7215 | 8c2f3757-48cf-470a-b236-3160cea2aeaf | 02b500bc-8395-42df-967a-3a870a1b7215 |          0 | MARC_BIB    | b0dafd3b-5e31-47bc-8d29-591520331587 | OLD   | c                    |     0 | f                  | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-16 18:25:21.527-08 | 3dc624cc-3bf1-4495-9964-41b0d59d5c4b | 2024-01-21 06:40:12.006189-08 | in00000047302

(I'm not sure how to read the timestamps for the order by; it doesn't look right.)

shelleydoljack commented 8 months ago

Also, I'm not sure the state field means anything. For instance, just looking at some source records from raw_records_lb table and then seeing what the endpoint /source-storage/records shows for them, a lot have state "OLD" but AFAICT it is the current source record for a record (it's in the 999 $f for it in inventory view). :shrug:

shelleydoljack commented 8 months ago

I'm trying to POST the supposed latest SRS for hrid in00000047302 and I keep getting an error about bad request. @jermnelson is also looking into it and we haven't figured it out yet. @ahafele the json I'm posting is in this file in00000047302_json.txt

Is it okay for our records to have a 100 with subfield 1? It is a supposedly valid field but I wonder in the folio context...?

ahafele commented 8 months ago

100 $1 should be fine in FOLIO and I'm not seeing any discussion of there being a current problem with this subfield. I'm adding relevant slack convos to https://docs.google.com/document/d/1_O1_Hx44q7o9Qmm_HvF7DNugAcDzUMZ8GYIEuT2v0zI/edit. I haven't found anything concrete but some discussion of similar issues. Should we ask the community or Index Data about this?

shelleydoljack commented 7 months ago

I'm going to see if I can use the change-manager endpoint to update the record.

shelleydoljack commented 7 months ago

I tried a PUT to /change-manager/parsedRecords/02b500bc-8395-42df-967a-3a870a1b7215 and it returned 202 accepted but looking at the logs for mod-srs:

2024-02-09T11:42:27-07:00 18:42:27.457 [vert.x-worker-thread-13] [] ERROR uickMarcKafkaHandler Failed to handle QM_RECORD_UPDATED event
2024-02-09T11:42:27-07:00 javax.ws.rs.NotFoundException: Record with id '34790208-7c5a-49b0-8c5c-9af6e721d92d' was not found

This is the json I tried PUTting: in00000047302_parsedRecord_json.txt

But if I do select * from sul_mod_source_record_storage.marc_records_lb where id = '34790208-7c5a-49b0-8c5c-9af6e721d92d'; I get the record. Also, if I do a select * from sul_mod_source_record_storage.raw_records_lb where id = '34790208-7c5a-49b0-8c5c-9af6e721d92d'; I get the record. I guess I'll just start a thread about it in folio slack about what to do next.

shelleydoljack commented 7 months ago

What is also kind of curious is that records_lb has numbers 0-6 in the generation column for this record but the instance shows "_version" : "8". Should the records_lb have a row with generation number 7?

shelleydoljack commented 7 months ago

Ahh, ☝️ it's probably unrelated. I looked at a "normal" record and the instance's version is 4 but the "latest" generation in the source record is 1.

shelleydoljack commented 7 months ago

I tried running this query (from the wiki page) and got 0 rows, so it's not the case (at least for this one example) that there is an external_id column data missing.

SELECT records_lb.id AS record_id, records_lb.external_id, record_instance_ids.instance_id
FROM sul_mod_source_record_storage.records_lb
JOIN (
    SELECT id, ids_subfields ->> 'i' AS instance_id
    FROM (
        SELECT id, fields
        FROM sul_mod_source_record_storage.marc_records_lb, jsonb_array_elements(content -> 'fields') as fields
        WHERE fields ? '999'
    ) AS ids_fields, jsonb_array_elements(fields -> '999' -> 'subfields') AS ids_subfields
    WHERE ids_subfields ? 'i'
) AS record_instance_ids ON records_lb.id = record_instance_ids.id
WHERE external_id IS NULL;
shelleydoljack commented 7 months ago

I started a Slack thread in FOLIO slack. https://folio-project.slack.com/archives/C2BHMHCFJ/p1707505894186479

shelleydoljack commented 7 months ago

I modified the query from the wiki page and it looks like there are 7 SRS with instance ID's that match external_id from the records_lb table:

SELECT records_lb.id AS record_id, records_lb.external_id, record_instance_ids.instance_id
FROM sul_mod_source_record_storage.records_lb
JOIN (
    SELECT id, ids_subfields ->> 'i' AS instance_id
    FROM (
        SELECT id, fields
        FROM sul_mod_source_record_storage.marc_records_lb, jsonb_array_elements(content -> 'fields') as fields
        WHERE fields ? '999'
    ) AS ids_fields, jsonb_array_elements(fields -> '999' -> 'subfields') AS ids_subfields
    WHERE ids_subfields ? 'i'
) AS record_instance_ids ON records_lb.id = record_instance_ids.id
WHERE external_id = 'b0dafd3b-5e31-47bc-8d29-591520331587';
              record_id               |             external_id              |             instance_id              
--------------------------------------+--------------------------------------+--------------------------------------
 36f40463-2cbb-4603-9c0a-ba37225cc926 | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587
 02b500bc-8395-42df-967a-3a870a1b7215 | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587
 70c1bdb8-93df-40f0-81df-2246fc32afa4 | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587
 34790208-7c5a-49b0-8c5c-9af6e721d92d | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587
 5b554e74-9db9-4cbe-90e3-e4811c5f253a | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587
 9f5e65f0-3770-428a-878b-bca2a435af4e | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587
 0ba61f07-33ff-4315-b10e-d5857c439459 | b0dafd3b-5e31-47bc-8d29-591520331587 | b0dafd3b-5e31-47bc-8d29-591520331587

It seems the problem is that none of the rows in records_lb has a state=ACTUAL.

shelleydoljack commented 7 months ago

Updating state to "ACTUAL" for the highest generation number in records_lb seemed to fix the problem. I will try updating the rest from the Book2.csv file shared in Slack. What other records need fixing?


update sul_mod_source_record_storage.records_lb
set state = 'ACTUAL'
where id = '0ba61f07-33ff-4315-b10e-d5857c439459';
shelleydoljack commented 7 months ago

hrid's in00000047503 and in00000047623 already have state = ACTUAL and it seems they're not broken in FOLIO.

shelleydoljack commented 7 months ago
select id, generation, state, external_hrid 
from sul_mod_source_record_storage.records_lb
where external_hrid in ('in00000047658', 'in00000047663', 'in00000047666', 'in00000047872', 'in00000047952', 'in00000048015', 'in00000048022', 'in00000048030', 'in00000048039', 'in00000048159', 'in00000048167', 'in00000048595', 'in00000048442', 'in00000048422', 'in00000048623', 'in00000048635', 'in00000048641', 'in00000048737', 'in00000048751', 'in00000048755', 'in00000048756', 'in00000048763', 'in00000048767');

Fixed the 23 records from Book2.csv:

update sul_mod_source_record_storage.records_lb
set state = 'ACTUAL'
where id in ('5ff8f78f-a2fe-4b69-a35a-8d8715fc7dd0', 'd64c8dd8-78af-40ac-a0ec-e9771594a260', '059fe2fb-e987-4975-a34d-782bdfb9cd46', '7c8ef0ba-4d6a-4e76-900f-de76036f5302', 'edfe4c36-dcf2-42f5-995a-8a3b69a5248c', '398c9d9e-37f2-49df-90d2-9ccf356568f5', '74f85a19-b213-4c12-9e46-8438031f8313', '04b14171-0aa7-4f23-9377-4b8d6cf73921', '4a61681b-becc-41b8-940c-93f1d126a9dc', 'b3047c83-fb44-47ed-99e3-950f257d132a', '5966b64b-4951-4e5d-bff7-e6d798b49b0a', '2f1dd48b-9004-43e6-9f55-bff39a6753e5', '059d9d8b-4117-409b-8d71-11d052143ed0', '464f8372-23bb-4b70-a39c-9c3cc73f2858', '59348611-209f-4d20-8954-86cdf35e65be', '5e423f61-2fd0-414d-9fd0-94420c95e06a', 'f5246c6f-0b46-420b-8cae-40ebc8fe8b2e', 'ad883531-0876-40ea-be44-895e32b3c71c', '221a6d0e-b7c4-4840-b2d4-d6d9faece19d', '0830ceb6-7786-47e7-b451-ec7ae1bfa849', '1e4d9c25-ecb9-4ec9-a02a-2bdf88eafa7d', '4bf7845b-b60f-4827-b20d-ff35625710d1', '5fd1d974-e2c2-4076-8877-7bc44ad0fe2e');
ahafele commented 7 months ago

Putting couple additional records that need fixing here. Can create a new ticket if preferred. in00000048016 in00000047295 in00000047284

Actually I'm wondering if we should use the metadb query from Ryan to get a list of others that should be fixed in this way? If that makes sense, I can run that query for most recent list of instance hrids

shelleydoljack commented 7 months ago

I'm not sure which query from Ryan you are referring to.

I fixed these 3 you listed above.

ahafele commented 7 months ago

New examples will be ticketed. May add to list of jobs that run periodically. #611