srophe / syriaca-data

Repository for Syriaca.org TEI data, used by srophe-eXist-app.
4 stars 16 forks source link

Bibl records not included in API Download from Zotero #1191

Open wlpotter opened 2 months ago

wlpotter commented 2 months ago

@wsalesky I did a bit more digging after #1186 and thing I've figured out which records are missing.

First, we have the same number of XML files and the exact same item keys as the records from the Zotero data dump that we got back in March/April. It does look like 1,797 of those items are notes, however. This means:

  1. I think we should revert the transform so it does filter out the items with "itemType": "note"
  2. We look to be missing about 1,500 records from Zotero

So I next compared the item keys from an export from the Zotero desktop client and found that 1,546 item keys from Zotero do not appear in the TEI XML files or the API data dump. (note that we did add new records starting last week, but I filtered those out based on the "dateAdded" field)

I have a list of those item keys, so I could write a quick script to just query the API for those specific item keys and download the JSON to run through the transform

wlpotter commented 1 month ago

@wsalesky I have downloaded the JSON of the missing zotero records; they are currently in a separate repo: https://github.com/wlpotter/zotcsv/tree/main/data. There are 1,546 of them chunked into files with 100 items each. I believe we can add these to the existing folder of data downloaded from zotero and re-run the transform?

I also want to flag that when we re-run the transform can we update the bibl URI base to use "/cbss/" rather than "/bibl/"? So currently, http://syriaca.org/bibl/8VSKN4EE (in Dev: https://dev.syriaca.org/bibl/8VSKN4EE) would become http://syriaca.org/cbss/8VSKN4EE

wsalesky commented 4 weeks ago

Updates here: https://github.com/srophe/syriaca-data/pull/1196

wlpotter commented 3 weeks ago

The updates to the idno format look good. I am still a bit confused about the number of records. We have 29,331 TEI XML bibls now but should only have 27,545. Are we still transforming notes or are they getting filtered out? I think we may still have some notes getting transformed, e.g. https://github.com/srophe/syriaca-data/blob/cbssDataUpdate8-8-24/data/bibl/tei/75RBT6SK.xml which is a note (cf. https://www.zotero.org/groups/4861694/a_comprehensive_bibliography_on_syriac_studies/items/RUT3P26M/note/75RBT6SK/library)

wsalesky commented 3 weeks ago

@wlpotter Okay. I will take another look, maybe there is another way to filter. Don't merge this I will make a new branch with the update data.