hbz / digitalisiertedrucke

Implements http://digitalisiertedrucke.de/
0 stars 0 forks source link

Missing collections #32

Closed fsteeg closed 7 years ago

fsteeg commented 7 years ago

Missing collections:

But they have associated titles:

fsteeg commented 7 years ago

This is weird. The old website lists 146 collections [1], we have 147 (the additional 1 is probably our own supercollection), so this does not fit with 9 missing collections. Maybe these are wrong collection IDs? In the old system, the collections have numerical IDs like the resources, see links in [1], while we use the content of 992 a as the ID. That's a custom field, right? @dr0i, do you remember why you picked that?

The collections have a 024 field, which sounds like the thing to use for the ID [2]:

<marc:datafield ind1="8" ind2=" " tag="024">
    <marc:subfield code="a">oai:digitalisiertedrucke.de:46713</marc:subfield>
    <marc:subfield code="p">collection</marc:subfield>
</marc:datafield>

Would it make sense to use that number as the ID? It would yield URIs like http://digitalisiertedrucke/collections/46713 for collections, instead of the current http://digitalisiertedrucke/collections/feldzeitungen.ub.hd.de. Don't know if that would solve the issue here, but it seems more correct to me. It might also fix some other collection-related issues we have.

[1] http://web.archive.org/web/20130526203147/http://www.digitalisiertedrucke.de/collection/Sammlungsbeschreibungen?ln=de [2] https://www.loc.gov/marc/bibliographic/bd024.html

acka47 commented 7 years ago

+1 for using the ID from 024 for collections.

fsteeg commented 7 years ago

After some further discussion with @acka47 we figured that the 001 actually contains the ID.

I had tested this with our current system and we get a resource with the same ID as the collection (see http://beta.digitalisiertedrucke.de/resources/46713), but that seems to be due to some kind of error, since the 001 IDs are actually unique:

$ cat hbz_zvdd_resource_marc.xml | grep "tag=\"001\"" | wc -l
491271
$ cat hbz_zvdd_resource_marc.xml | grep "tag=\"001\"" | sort -u | wc -l
491271

With this approach, we can also reconstruct the old URLs like http://digitalisiertedrucke.de/record/46713 (I'll open a separate issue for that).

dr0i commented 7 years ago

sounds reasonable. +1

fsteeg commented 7 years ago

So here's the problem: in the 024 field in https://github.com/hbz/digitalisiertedrucke/issues/32#issuecomment-255364712, the a subfield's ID is not the collection from the p subfield, but the ID of the title resource itself. See title MAB-MXL in [1].

So it seems the isPartOf relationship is only expressed in our data with the IDs we used before. I guess this was the reason @dr0i used that as the ID when he originally wrote the transformation.

We could either stick with the current IDs, or implement a mapping from these to the numerical ones.

This affects #45 (restoring old URLs).

What do you think @acka47?

[1] https://github.com/hbz/digitalisiertedrucke/wiki/Example-resources

acka47 commented 7 years ago

@dr0i and me just sat down to understand this. It seems as it was decided to not add any collection descriptions at the end of digitalisiertedrucke but to only create collections by linking records together. That's why 8-9 collections are missing. We already created rudimentary descriptions for these with https://github.com/lobid/lodmill/blob/master/lodmill-rd/transformations/zvdd/data/missing_collections.ttl. These should just be added to the data.

For two collections there is no description, though: collection:einblattdrucke_vd17.gbv.goe.de & collections:zvdd.hbz.k.de. The first probably is an error and we should use collection:einblattdrucke.vd17.oo.de instead (see also https://github.com/lobid/lodmill/blob/master/lodmill-rd/transformations/zvdd/statistic/mismatching_collection-IDs.textile). The second is the super collection for all other collection. I think we can completely discard it in the UI.

Regarding the identifiers, we should keep using the literal ones for collections and look into using the /record/$ID pattern for single resources. Thus, we would regain old URLs for the biggest part of resources.

acka47 commented 7 years ago

As discussed with @fsteeg today, I will add the data from https://github.com/lobid/lodmill/blob/master/lodmill-rd/transformations/zvdd/data/missing_collections.ttl as JSON to https://github.com/hbz/digitalisiertedrucke/tree/master/conf.

I will also use the "32-fixEinblattudrucke" branch for this.

fsteeg commented 7 years ago

Deployed to staging:

Will implement URL pattern as described in https://github.com/hbz/digitalisiertedrucke/issues/32#issuecomment-256020938 in #45.

acka47 commented 7 years ago

+1