scientist-softserv / britishlibrary

Other
3 stars 0 forks source link

(Kew) keyword controlled vocabularies #123

Open orangewolf opened 1 year ago

orangewolf commented 1 year ago

3) From the view point of a library cataloguer controlled vocabulary for keywords would be lovely, even if we have to add a selection of them in addition to those created by the journals, at the moment it is a bit of a mess, with keywords +/- capital letters, +/- plurals, +/- spaces before and after the keyword, just to name a few.

Waiting for AHRC-funded report on subject indexing/controlled keywords due 15/7/22 before scoping this ticket more thoroughly.

grahamjevon commented 1 year ago

When SoftServ asked for more details about this development in last week's sprint meeting (21/03/2023), we posited that this ticket was either intended to focus on:

  1. A way to standardise the free text "keywords" field, which has no control (e.g. three different works might use the following 3 different keywords: "newspaper", "newspapers", "Newspapers")
  2. Integration with established vocabs (e.g. Library of Congress, FAST headings).

The July report on subject indexing/controlled values leans mostly towards 2. The report highlights that different tenants use different vocabs, so the first question would be: which vocab(s) should we integrate with? Can we integrate with multiple vocabs? The phrase "even if we have to add a selection of them" in the comment above suggests that integration with multiple vocabs was expected.

grahamjevon commented 1 year ago

Elaboration on 2. Integration with established vocabs (e.g. Library of Congress, FAST headings).

Integrated vocab requirements:

Suggestions for UI

Suggestions for BX importer

Using FAST as an example

In this screenshot of the FAST vocabulary, we would be interested in the term "Newspapers" and the corresponding ID "fst01037111"

Image

Connection to Search Facets

I see two options:

A. Facet per vocabulary (e.g. FAST, LOC, AAT, etc.)

Image

B. Facet per vocab term type (e.g. Subjects, People, Places). This would therefore group terms from different vocabs into a single facet Image

I would err towards option B.

UI and BX Question

Similar to the facet options, we have options for metadata fields:

X. Separate fields per vocabulary (e.g. Fast, Library of Congress, Getty AAT, Getty TGN, Geonames)

Image

Image

Y. Vocab terms terms grouped by term type (People, Places, Subjects). In this example, the place name field would be connected to both the Getty TGN and Geonames vocab lists. If, for example, I selected "Canada", it would pull in the Getty TGN ID and the Geonames ID for Canada.

Image

Issues to consider

ShanaLMoore commented 1 year ago

temporary implementation idea: wire up questioning authority for the fields, and use a placeholder yml file until the client updates us with real values.

slack: https://assaydepot.slack.com/archives/G030UPFBT2S/p1679951751344789

funder name makes an api call: https://github.com/search?q=repo%3Ascientist-softserv%2Fbritishlibrary%20def%20funders&type=code

institution uses set values: https://github.com/scientist-softserv/britishlibrary/pull/340/files

However it seems like the client will want something dynamic requiring API calls, so it'll prob be smart to hold off for now until we settle on clarifications and scope.

EDIT: According to Jill, this work may not be approved for AHRC 2 after all. ref: https://assaydepot.slack.com/archives/C0313NK2LJ0/p1680043474358629

kirkkwang commented 1 year ago

Hi @grahamjevon we have a few questions for this ticket:

In Bulkrax, if you want the ID to by propagate the term and authorities, we would only be able to support one source. If multiple authorities are required then we'd need the id and the authority (eg fst01423814, searchFAST). Do you anticipate needing multiple authorities?

Do the categories and authorities (places, subjects, people; LoC, FAST, etc) need to vary by tenant? Are these likely to change (eg: add more categories? or add more authorities?)

If we're pulling from multiple authorities, do the terms need to be unique? Can you add "Newspaper" from more than one authority? (LoC vs FAST). If terms are not unique, it seems we would want to display the term plus authority/id to differentiate.

grahamjevon commented 1 year ago

Hi @kirkkwang, I notice that @jillpe has queried whether this is part of the current batch of AHRC funded work. Whatever the outcome of that query, it seems worth me answering these queries as this will probably proceed at some point. The first thing to say is that I think all options are open for this ticket. We don't yet have a fixed scope in mind.

In Bulkrax, if you want the ID to by propagate the term and authorities, we would only be able to support one source. If multiple authorities are required then we'd need the id and the authority (eg fst01423814, searchFAST). Do you anticipate needing multiple authorities?

I've gone into this in a bit more detail below. I think we would want more than one authority. But perhaps we might only have one authority per category (e.g. one subject authority, one place authority). But this might get more complicated if any of the tenants want to use a specialist authority. I don't think specifying the ID and the authority would be a problem. But perhaps this could be eased by having a separate column in BX and a separate field in the UI for each authority (e.g. a FAST column/field and GettyTGN column/field.

Do the categories and authorities (places, subjects, people; LoC, FAST, etc) need to vary by tenant? Are these likely to change (eg: add more categories? or add more authorities?)

Good question. The July report certainly suggests that different tenants use (or could use) different authorities. But there is also some overlap. I think that FAST and the four Getty databases will have the broadest reach across multiple tenants. Although I used LoC as an example, I do not expect us to want to integrate with LoC. I notice that the tenant account settings already include a geonames username field. So I wonder if some level of integration with geonames already exists? Ultimately, I think we will want to reach out to the tenants to see which authorities they would value most.

While different tenants may utilise different authorities, it might be okay for all tenants to have access to all integrated authorities. They can just ignore the ones they choose not to use. That would seem like the easiest approach. But perhaps there are things I have not considered.

The most obvious area I can think of where a degree of tailoring per tenant maybe required is in the context of the search facets. If a tenant does not use a particular authority/category, they won't want an empty search facet.

While it is always possible that we may want to add more categories or authorities in the future, I think the intention would be to establish something stable, with additional authorities or categories only sought if a tenant makes a major change to their practices or if we onboard a new partner that requires a different authority/category.

If we're pulling from multiple authorities, do the terms need to be unique? Can you add "Newspaper" from more than one authority? (LoC vs FAST). If terms are not unique, it seems we would want to display the term plus authority/id to differentiate.

This is possibly the area that requires most thought on our part. But unique is always best. For that reason, I am inclined to suspect that we will want to avoid pulling from multiple authorities with crossover (e.g. we will choose to integrate with either geonames or Getty TGN for place authorities, but we probably won't integrate with both). For subjects we would probably prioritise FAST. However, where this to get complicated is when we enter the world of specialist authorities (e.g. The International Plant Names Index - which Kew might want to use).

I think the next step for us would be to speak to the other tenants, find out what they want/need/would use if it was available. Those answers will start to give us some clarity.

grahamjevon commented 1 year ago

@kirkkwang An additional thought on this issue, but relating to the free text keywords (as something distinct from the authorities discussed above). I think it is important that keywords remain free text, to give users complete freedom to add any keyword they like. But I wonder if it would be possible for the free text field to autofill so that users can be guided to select from an existing keyword. For example, if I start typing "dig..." the autofill will start to show a dropdown menu of existing keywords such as "digital scholarship" and "digital humanities". I could then select one of these, safe in the knowledge that I am selecting an existing keyword and therefore making a genuine link between multiple works. Or I could choose to continue writing something new, such as "digital cameras".

kirkkwang commented 1 year ago

Thanks for the answer @grahamjevon, that was helpful. I'll hold until further notice for now, but this is definitely going to be a pretty big lift.