hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
7 stars 7 forks source link

Collection from Union Catalogue missed in lobid #1052

Closed hagbeck closed 4 years ago

hagbeck commented 4 years ago

In the union catalogue exists collection i.e. for e-books which are not starting with "ZDB". These are missed in lobid.

In http://lobid.org/resources/HT020022241.json there is the collection name "cuvillier". Other collection names can be found in https://service-wiki.hbz-nrw.de/display/VDBE/Produktsigel+und+interne+Selektionskennzeichen

acka47 commented 4 years ago

Thanks for the request. Collection information is in 078 in the source data, from the example http://lobid.org/hbz01/HT020022241:

<datafield tag="078" ind1="e" ind2="1">
  <subfield code="a">cuvillier</subfield>
</datafield>

Currently, we are only transforming some dedicated collection information (NWBib, Edoweb, FRL, ZDB) to RDF. We will have to think about a way for adding generic collection information. Maybe with a bnode, so that the resulting JSON could look like this:

{
    "inCollection": [
        {
            "label": "cuvillier"
        }
    ]
}

However, before approaching this it would be nice to know how many different collections are named in 078 and what their tags are.

dr0i commented 4 years ago

However, before approaching this it would be nice to know how many different collections are named in 078 and what their tags are.

We could do this with ES aggs but this would mean to explicitly allow aggregations over labels (meaning: new index config, and performance impact). Doing this on the 36k smallTest reveals just:

{ "key" : "Zeitschriftendatenbank (ZDB)", "doc_count" : 163 }, { "key" : "eResource package", "doc_count" : 129 }, { "key" : "Nordrhein-Westfälische Bibliographie (NWBib)", "doc_count" : 123 }, { "key" : "Edoweb Rheinland-Pfalz", "doc_count" : 16 }, { "key" : "Elektronische Zeitschriftenbibliothek (EZB)", "doc_count" : 10 }, { "key" : "Fachrepositorium Lebenswissenschaften", "doc_count" : 10 }, { "key" : "Rheinland-Pfälzische Bibliographie", "doc_count" : 8 } so you may still want to have a complete list on the whole index?

acka47 commented 4 years ago

However, before approaching this it would be nice to know how many different collections are named in 078 and what their tags are.

The list with all the collections is linked in the original issue comment: https://service-wiki.hbz-nrw.de/display/VDBE/Produktsigel+und+interne+Selektionskennzeichen

Here are the 36 IDs that are not an ISIL:

asmi
budri
caso
chiso
Cont
cuvillier
dawsonera
edso 
editlib
elgar
lyell
hade
hirzel
huguenots
iorm
kenso
learntechlib
Logos
mansi
misso
MPSO
mpig
NNg
obp
oso
pearson
philon
luther
smalib
minnso
uncso
vkal
vkv
vogel
wageningen
woodhead
wtm

Here is another proposal to model this in JSON-LD with a newly added identifier key that is mapped to dct:identifier. The advantage would be that we'd also have the label, a disadvantage is that we'd have to maintain a id->label map:

{
    "@context":{
        "@import":"http://lobid.org/resources/context.jsonld",
        "identifier":"http://purl.org/dc/terms/identifier"
    },
    "inCollection":[
        {
            "identifier":"cuvillier",
            "label":"Cuvillier-E-Books"
        }
    ]
}
dr0i commented 4 years ago

Hm, would it make sense to use something like this:

"inCollection":[
    {
        "id":"https://lobid.org/vocabs/cuvillier",
        "label":"Cuvillier-E-Books"
    }

With this approach we could utilize Etikett to set the label as we normaly do. Also, we could provide some more data about the collections if we want. Disadvantage: we have to enhance the vocabs.

acka47 commented 4 years ago

+1 But I think we should then use lobid-resources URIs that – in this case – would not resolve:

"inCollection":[
    {
        "id":"https://lobid.org/resources/cuvillier",
        "label":"Cuvillier-E-Books"
    }
acka47 commented 4 years ago

Decision after offline discussion:

"inCollection":[
    {
        "id":"https://lobid.org/collections#cuvillier",
        "label":"Cuvillier-E-Books"
    }

For now, these will not resolve but we could add a file at https://lobid.org/collections in the future if needed.

dr0i commented 4 years ago

There seem to be more than just 36 IDs, e.g. dilibri. After complete indexing we can obtain a list of them by querying the api.

dr0i commented 4 years ago

A little bit confused why you removed the dilibri label. If it's not an e-book at all (seems possible, see definition of dilibri) it should be given a better name (not the Dilibri E-Book i gave it) and, worse, if it's not a collection at all (which IMO it is) it should be somehow filtered out in the morph completely. For now it is subsumed under inCollection missing a proper label.

dr0i commented 4 years ago

In production. Getting this list of ids with counts:

https://lobid.org/collections#NLZ: 491824
https://lobid.org/collections#ldd: 132898
https://lobid.org/collections#vl-ulbd: 64608
https://lobid.org/collections#Springer: 60456
https://lobid.org/collections#vl-ulbms: 22128
https://lobid.org/collections#dilibri: 8249
https://lobid.org/collections#s2w-zbmed: 6139
https://lobid.org/collections#s2w-retropadubpb: 4042
https://lobid.org/collections#GBV-1-NEF: 3392
https://lobid.org/collections#s2w-ulbbonn: 3241
https://lobid.org/collections#s2w-hsspadubpb: 2604
https://lobid.org/collections#vd18: 1666
https://lobid.org/collections#vl-ddbk: 1439
https://lobid.org/collections#s2w-llbdetmold: 1371
https://lobid.org/collections#GBV-1-NEL: 1001
https://lobid.org/collections#s2w-ulbbonndfg: 947
https://lobid.org/collections#Lizenz2009: 913
https://lobid.org/collections#Lizenz2008: 739
https://lobid.org/collections#dawsonera: 669
https://lobid.org/collections#Lizenz2010: 566
https://lobid.org/collections#rez: 546
https://lobid.org/collections#lyell: 475
https://lobid.org/collections#Cont: 443
https://lobid.org/collections#taylor francis: 369
https://lobid.org/collections#Lizenz2011: 321
https://lobid.org/collections#Lizenz2014: 311
https://lobid.org/collections#Lizenz2016: 291
https://lobid.org/collections#Lizenz2012: 226
https://lobid.org/collections#Lizenz2013: 211
https://lobid.org/collections#wbv: 207
https://lobid.org/collections#BeltzLizenz2016: 203
https://lobid.org/collections#BeltzLizenz2017: 187
https://lobid.org/collections#BeltzLizenz2015: 168
https://lobid.org/collections#Lizenz2017: 156
https://lobid.org/collections#thieref: 152
https://lobid.org/collections#mansi: 149
https://lobid.org/collections#Lizenz2018: 141
https://lobid.org/collections#V&RELibraryLizenz2014: 135
https://lobid.org/collections#budri: 134
https://lobid.org/collections#fzo: 131
https://lobid.org/collections#luther: 128
https://lobid.org/collections#elgar: 126
https://lobid.org/collections#Lizenz2015: 123
https://lobid.org/collections#huguenots: 121
https://lobid.org/collections#bloomsbury2016: 118
https://lobid.org/collections#juris: 112
https://lobid.org/collections#bloomsbury2014: 103
https://lobid.org/collections#bloomsbury2015: 101
https://lobid.org/collections#BeltzLizenz2018: 98
https://lobid.org/collections#BeltzLizenz2019: 95
https://lobid.org/collections#BeltzLizenz2014: 92
https://lobid.org/collections#bloomsbury2017: 91
https://lobid.org/collections#bloomsbury2013: 80
https://lobid.org/collections#vogel: 80
https://lobid.org/collections#KohlhammerLizenz2014: 78
https://lobid.org/collections#BeltzLizenz2013: 77
https://lobid.org/collections#smalib: 74
https://lobid.org/collections#pearson: 73
https://lobid.org/collections#mpig: 72
https://lobid.org/collections#Lizenz2019: 69
https://lobid.org/collections#igi global: 68
https://lobid.org/collections#synthesis lectures: 56
https://lobid.org/collections#V&RELibraryLizenz2017: 53
https://lobid.org/collections#MohrSiebeckLizenz2018: 51
https://lobid.org/collections#V&RELibraryLizenz2016: 50
https://lobid.org/collections#WallsteinLizenz2019: 48
https://lobid.org/collections#KohlhammerLizenz2016: 44
https://lobid.org/collections#oso: 39
https://lobid.org/collections#vkal: 39
https://lobid.org/collections#BeltzLizenz2012: 38
https://lobid.org/collections#wageningen: 38
https://lobid.org/collections#KohlhammerLizenz2018: 36
https://lobid.org/collections#WBGLizenz2017: 35
https://lobid.org/collections#KohlhammerLizenz2019: 32
https://lobid.org/collections#learntechlib: 31
https://lobid.org/collections#KohlhammerLizenz2013: 30
https://lobid.org/collections#melanchthon: 30
https://lobid.org/collections#V&RELibraryLizenz2018: 29
https://lobid.org/collections#beofamilien: 28
https://lobid.org/collections#vkv: 26
https://lobid.org/collections#beozivil: 24
https://lobid.org/collections#WallsteinLizenz2018: 23
https://lobid.org/collections#woodhead: 20
https://lobid.org/collections#KohlhammerLizenz2017: 16
https://lobid.org/collections#Lizenz2007: 14
https://lobid.org/collections#WBGLizenz2019: 14
https://lobid.org/collections#chiso: 13
https://lobid.org/collections#cuvillier: 13
https://lobid.org/collections#WBGLizenz2016: 10
https://lobid.org/collections#WBGLizenz2018: 9
https://lobid.org/collections#MPSO: 8
https://lobid.org/collections#hade: 8
https://lobid.org/collections#MohrSiebeckLizenz2013-2015: 7
https://lobid.org/collections#MohrSiebeckLizenz2019: 5
https://lobid.org/collections#caso: 5
https://lobid.org/collections#obp: 5
https://lobid.org/collections#Logos: 4
https://lobid.org/collections#Lizenz2005: 3
https://lobid.org/collections#beosteuer: 3
https://lobid.org/collections#iorm: 3
https://lobid.org/collections#uncso: 3
https://lobid.org/collections#TTP-MCE: 2
https://lobid.org/collections#asmi: 2
https://lobid.org/collections#beoarbeit: 2
https://lobid.org/collections#edso: 2
https://lobid.org/collections#minnso: 2
https://lobid.org/collections#misso: 2
https://lobid.org/collections#mso: 2
https://lobid.org/collections#s2w-hsspadmindest: 2
https://lobid.org/collections#KohlhammerLizenz2015: 1
https://lobid.org/collections#Lizenz2001: 1
https://lobid.org/collections#Lizenz2004: 1
https://lobid.org/collections#Lizenz2006: 1
https://lobid.org/collections#NNg: 1
https://lobid.org/collections#kenso: 1
https://lobid.org/collections#palgraveoa: 1

by doing:

curl -XGET  'http://weywot3.hbz-nrw.de:9200/resources/_search?q=inCollection.id:*collections*&pretty=true' -d '
{
  "size": 0,
  "aggs": {
          "aggs1": {
              "terms": {
                "field": "inCollection.id",
                "size": 11350
              }
          }
        }
}
' | paste - - |grep -v "{" |grep -v "}"| sed 's#.*"key" : "##g' |sed 's#\(.*\)",.*"doc_count" \(.*\)#\1\2#g' | grep  collections
acka47 commented 4 years ago

Thanks @dr0i, I will look into adding some more labels to the labels.json.

acka47 commented 4 years ago

I don't like the URI with a space in it: "https://lobid.org/collections#synthesis lectures" Could you just remove spaces during the transformation, @dr0i?

Also https://lobid.org/collections#taylor francis.

acka47 commented 4 years ago

There are also some with & in them, e.g. https://lobid.org/collections#V&RELibraryLizenz2014. It is probably no problem after the hash, but we might remove those as well...

dr0i commented 4 years ago

Is this really neccessary? Good using hash-URIS, because as you've noted, these URLs don't make any problem (at least not in indexing, in showing, in querying, in (not) resolving). Also, we would have a "natural" way of retrieving/building these URLs. Maybe we should concentrate on real use cases. Are there any?

hagbeck commented 4 years ago

It seems to be OK. I think there is nobody who can verify the counts, so we will see in practice, if its complete. Our examples are OK.

Many thanks!

acka47 commented 4 years ago

Thanks for the feedback, @hagbeck . We will close this issue as soon as https://github.com/hbz/lobid-resources/pull/1062 is deployed.

acka47 commented 4 years ago

Closing, as #1062 is deployed to production.