sul-dlss / searchworks_traject_indexer

indexing MARC, MODS, and more for SearchWorks
Other
6 stars 1 forks source link

draft data specification for record export from FOLIO #639

Closed thatbudakguy closed 2 years ago

thatbudakguy commented 2 years ago

background

with the move to FOLIO, we need to re-evaluate the record format currently produced by ExportSearchWorks via querying Symphony APIs. these records are picked up by traject and used to generate the solr documents that power SearchWorks. for an overview of how this happens, see MARC extract from Symphony (google doc).

part of the transition process will also be decommissioning ExportSearchWorks, so individual parts of its business logic might move to Airflow DAGs, into FOLIO workflows, or into searchworks-traject-indexer. understanding the type of records that we need to generate will help inform this process.

we also could cease using binary marc as an export format and move to JSON or another friendlier record type, since FOLIO exports are JSON and there's no particular reason traject needs to consume binary marc.

a rough overview of what we would (ideally) pass to traject, via @cbeer:

 for each bib record:
   get the bib record out of SRS,
   merge in all the holdings,
   merge in all the instance item data,
   merge in any reserves data,
   merge in a mapping of UUID => { short id, label, ?? }  for libraries, locations, item types, etc

this would result in a record that gives traject all it needs to construct a solr document capable of powering discovery in SW, and delivery/access via requests and course reserves.

much of this merge work is currently happening in traject-land now.

questions

cbeer commented 2 years ago

should we move away from binary marc as an export format? if so, what should be the new format?

Yes. I think we should plan on working with JSON records. I think we'd be fine if we got the bib/holdings/instance data in a shape like:

{
  "sourceRecord": { },
  "holdingsRecords": [],
  "instances": [],
}

I'm not sure I understand how holdings records map to MHLD fields or instances to 999s, but we might as well do that data transform in traject-land 🤷‍♂️

I'm not sure what to do about library, location, item type, etc. It might be nice if the relevant mappings were included in the record dump so we don't have to chase them down, even if it results in some slightly larger records.

thatbudakguy commented 2 years ago

bib data: probably using SRS Streaming API, as in this example notebook, partly for performance reasons

holdings data: probably using holdings storage API, so long as we can later transform its UUIDs into more useful values

shelleydoljack commented 2 years ago

We are using folio_migration repo to map our data to the FOLIO formats. Maybe looking at the mapping files might bring some stuff to light.

Also, I think you mean "items" instead of "instances" in the above comments. FOLIO instances equate to title/instance-level description, like MARC bibliographic data.

{
  "sourceRecord": { },
  "holdingsRecords": [],
  "items": [],
}

Would the field "sourceRecord" contain the MARC in JSON or would it look like what we get from the SRS streaming API? Would the holdingsRecords and items contain those objects as they are coming out of FOLIO or would they be processed in some way before handing off to traject?

cbeer commented 2 years ago

I guess holdingRecords and items are redundant? https://s3.amazonaws.com/foliodocs/api/mod-inventory-storage/p/holdings-storage.html#holdings_storage_holdings__holdingsrecordid__get seems to include both

thatbudakguy commented 2 years ago

@shelleydoljack thanks for the terminology correction! i edited the top post for clarity.

cbeer commented 2 years ago

Would the field "sourceRecord" contain the MARC in JSON or would it look like what we get from the SRS streaming API?

I think the whole output of e.g. https://s3.amazonaws.com/foliodocs/api/mod-source-record-storage/p/source-record-storage-records.html#source_storage_records__id__get would be useful in some way. I'm not sure what the SRS streaming API provides.

shelleydoljack commented 2 years ago

I guess holdingRecords and items are redundant? https://s3.amazonaws.com/foliodocs/api/mod-inventory-storage/p/holdings-storage.html#holdings_storage_holdings__holdingsrecordid__get seems to include both

Oh yea, it looks like item records would be included in the field "holdingsItems". Hmm, for some reason I'm not seeing any data in that field even though there should be.

On folio-test, /holdings-storage/holdings/5342092f-9f84-5942-b557-02a30b071751

{
    "id": "5342092f-9f84-5942-b557-02a30b071751",
    "_version": 1,
    "hrid": "ah5660140_1",
    "holdingsTypeId": "03c9c400-b9e3-4a07-ac0e-05ab470233ed",
    "formerIds": ["a5660140"],
    "instanceId": "2ec12a10-c73f-56aa-a46a-ae40278fa74e",
    "permanentLocationId": "9a0e5db7-2dda-4bdd-bca1-840f4d028b2b",
    "effectiveLocationId": "9a0e5db7-2dda-4bdd-bca1-840f4d028b2b",
    "electronicAccess": [
    ],
    "callNumberTypeId": "95467209-6d7b-468b-94df-0f5d7ad2747d",
    "callNumber": "NK1510 .I52",
    "administrativeNotes": [],
    "notes": [],
    "holdingsStatements": [],
    "holdingsStatementsForIndexes": [],
    "holdingsStatementsForSupplements": [],
    "statisticalCodeIds": [],
    "holdingsItems": [],
    "bareHoldingsItems": [],
    "metadata": {
        "createdDate": "2022-06-21T21:49:37.736+00:00",
        "createdByUserId": "8cc3ab86-c943-4d53-8df7-b3dc64fb44ee",
        "updatedDate": "2022-06-21T21:49:37.736+00:00",
        "updatedByUserId": "8cc3ab86-c943-4d53-8df7-b3dc64fb44ee"
    },
    "sourceId": "f32d531e-df79-46b3-8932-cdd35f7a2264"
}

And its item record: /item-storage/items?query=holdingsRecordId==5342092f-9f84-5942-b557-02a30b071751

{
    "items": [
        {
            "id": "3303ac63-098a-5257-a13d-63c00ef335cf",
            "_version": 1,
            "hrid": "ai5660140_1",
            "holdingsRecordId": "5342092f-9f84-5942-b557-02a30b071751",
            "formerIds": [],
            "barcode": "5660140-1001",
            "effectiveShelvingOrder": "NK 41510 I52 BASE",
            "effectiveCallNumberComponents": {
                "callNumber": "NK1510 .I52",
                "typeId": "95467209-6d7b-468b-94df-0f5d7ad2747d"
            },
            "volume": "BASE",
            "yearCaption": [],
            "administrativeNotes": [],
            "notes": [
                {
                    "itemNoteTypeId": "079c6747-3297-48db-a96a-524c97b42916",
                    "note": "BASE CALL NUMBER. DO NOT REMOVE OR REUSE. (dlb 10/27/15); nocopy.aja6/16/2004; |aaja12/17/2004/(V.12:1(2004))/SO  /i:frl 22 June 2006",
                    "staffOnly": true
                }
            ],
            "circulationNotes": [ ],
            "status": {
                "name": "Available",
                "date": "2022-06-21T21:49:23.212+00:00"
            },
            "materialTypeId": "d934e614-215d-4667-b231-aed97887f289",
            "permanentLoanTypeId": "2b94c631-fca9-4892-a730-03ee529ffe27",
            "permanentLocationId": "9a0e5db7-2dda-4bdd-bca1-840f4d028b2b",
            "effectiveLocationId": "9a0e5db7-2dda-4bdd-bca1-840f4d028b2b",
            "electronicAccess": [],
            "statisticalCodeIds": [],
            "metadata": {
                "createdDate": "2022-06-21T21:50:04.430+00:00",
                "createdByUserId": "8cc3ab86-c943-4d53-8df7-b3dc64fb44ee",
                "updatedDate": "2022-06-21T21:50:04.430+00:00",
                "updatedByUserId": "8cc3ab86-c943-4d53-8df7-b3dc64fb44ee"
            }
        }
    ],
    "totalRecords": 1,
    "resultInfo": {
        "totalRecords": 1,
        "facets": [],
        "diagnostics": []
    }
}
shelleydoljack commented 2 years ago

After discussion with the PO and developers, here is the response regarding the fields "holdingsItems" and "bareHoldingsItems" from the /holdings-storage/holdings/{holdingsId} endpoint:

It’s worth keeping in mind that these aren’t real properties on the records. They aren’t intended for use by users of the API or the UI. They are configuration for how mod-graphql navigates between record types. They are an artefact of how we chose to annotate the APIs for GraphQL adoption.

I think we should query /item-storage/items?query=holdingsRecordId=={UUID} to get the associated item data.

thatbudakguy commented 2 years ago

closing this ticket since we're pretty confident in our JSON record structure in traject now and the remaining questions have their own tickets (linked in top comment).