hathitrust / hathifiles

Generation of Hathfiles
0 stars 0 forks source link

GS-2379: Case of the missing source bib numbers #29

Closed aelkiss closed 1 year ago

aelkiss commented 1 year ago

I don't love having to rely on the contributor config submodule especially since it's pretty much guaranteed to break whenever we get new contributors if we don't update here. ~I think before there was essentially just an 'exception list' where the mapping wasn't the trivial one. I will probably add some logic to do that.~ Longer term, we should think about a better way of getting this information (maybe an API in Zephir, or in the worst case at least fetching these when we run instead of needing them baked into the image?)

The underlying issue with missing source bib nums has to do with the build process not including submodules in the image; see https://github.com/hathitrust/github_actions/pull/6)

aelkiss commented 1 year ago

Zephir folks @bwcormack @cscollett @jsjiang -- do you recall @jsteverman talking about getting the contributor configs this way? Do you have any other thoughts about how to get the collection/campus code information for mapping sdrnum prefixes to contributor for use in hathifile generation?

There's also a static map here https://github.com/hathitrust/post_zephir_processing/blob/main/data/sdr_num_prefix_map.tsv but as far as I can tell it's only used to emit a warning if there's no mapping -- it may previously have been used in hathifile generation..

coveralls commented 1 year ago

Coverage Status

Coverage increased (+1.1%) to 96.429% when pulling 9c1077e4c70f0a5127680d1d12eda22831dc0afe on GS-2379-missing-source-bib-num into 7b9480f97b9e2827e4244efc6130610e8c2e5034 on main.

cscollett commented 1 year ago

Sorry, I have have no recollection about discussing this with Josh, nor can I find any messages about it. This may be derived from the shared table for configs/streams on HT Google Drive.

aelkiss commented 1 year ago

Related discussions:

https://hathitrust.slack.com/archives/C6SB35YCU/p1665681347510709 (talking about the functionality and mapping table in post_zephir_processing for sdrnum)

https://hathitrust.atlassian.net/browse/DEV-457 for removing the check for sdrnum to contributor map in post-zephir processing, now that it's no longer used for hathifile generation there

aelkiss commented 1 year ago

I also think it seems reasonable to use a default mapping if the sdrnum prefix isn't found in the config, which should mitigate the issue with needing to constantly update. @bwcormack It looks like new configs use the same thing for campus_code and collection (except collection is uppercased). Are there any cases where a new config would use something different for campus_code and collection?

billdueber commented 1 year ago

Code looks great to me. Is there a reason it's still in draft?

bwcormack commented 1 year ago

Hi Aaron,

For the newer Zephir configs, we assign a campus_code that’s the same as the config’s name, minus any numbers. For example: gu-1.config has campus_code of gu. The collection_code in Zephir configs is assigned by HathiTrust. It does often match the campus_code and sometimes also the code for source in the Zephir config. (I keep repeating “Zephir config” to make clear which configuration file I’m talking about.) I try to use the MARC organization code for the config name and campus_code, unless it’s something unwieldy. What practice does HT follow for assigning the collection code?

Conventions for assigning these codes have, unsurprisingly, varied over time. There are some configs where the campus_code and collection do not match. Some examples:

uiowa-2.config has campus_code uiowa and collection IaU. ia-duke.config has campus_code ia-duke and collection IDUKE. txsu-1.config has campus_code txsu and collection TXSMTSU.

Let me know if you need more examples or comments about this.

From: Aaron Elkiss @.> Sent: Thursday, November 10, 2022 8:38 AM To: hathitrust/hathifiles @.> Cc: Barbara Cormack @.>; Mention @.> Subject: Re: [hathitrust/hathifiles] GS-2379: Case of the missing source bib numbers (PR #29)

CAUTION: EXTERNAL EMAIL

I also think it seems reasonable to use a default mapping if the sdrnum prefix isn't found in the config, which should mitigate the issue with needing to constantly update. @bwcormackhttps://github.com/bwcormack It looks like new configs use the same thing for campus_code and collection (except collection is uppercased). Are there any cases where a new config would use something different for campus_code and collection?

— Reply to this email directly, view it on GitHubhttps://github.com/hathitrust/hathifiles/pull/29#issuecomment-1310573298, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG6IS65YQC7DTXJLXBUTV2DWHUQFFANCNFSM6AAAAAAR36BP5Q. You are receiving this because you were mentioned.Message ID: @.**@.>>

aelkiss commented 1 year ago

We use the MARC organization code for the collection code. I can't think of any recent examples that have not, although (as you say) there are historical examples where it doesn't match.

aelkiss commented 1 year ago

@billdueber Based on the discussion above I will add a default for the contributor to collection code map & then I think this will be ready.

aelkiss commented 1 year ago

@billdueber Added a commit based on the conversation above. I think this is ready to go now.

Relatedly, I also feel good about doing https://hathitrust.atlassian.net/browse/DEV-457 for removing the check for sdrnum to contributor map in post-zephir processing, since I understand why it was there to begin with and we're confident it's not necessary there.

aelkiss commented 1 year ago

~I just looked at an actual record and it looks like it has the local bib num in the HOL$1 (with the HTID in HOL$p). That would obviate the need to do any of this mapping on the hathifiles side.~

Unfortunately there's only one HOL field (presumably for the preferred record) so we do need to do the mapping.

aelkiss commented 1 year ago

@billdueber and I looked again, this seems fine. Merging now, will deploy on Monday & attempt to regenerate the files for days with missing local bib numbers.