Canonical Tags: Subjects to become 1st class objects in metamodel

tfmorris commented 4 years ago

Subjects are currently treated as strings, with light normalization to coalesce similar strings, which limits our flexibility to do things like support aliases, multiple languages, metadata such as descriptions, links to Wikidata, etc.

Proposal & Constraints

Subjects should be first class objects with a set of attributes including:

key
preferred label (one per language)
description (one per language)
aliases (multiple per language)
external identifier(s) - Wikidata to start, perhaps others like FAST

Component Updates

Change importer to look up using subject labels and aliases and return the subject key to be stored.
Change subject display on works, etc pages to use preferred label in the user's preferred language
Change subject page to include description and aliases as well as preferred label. Allow editing of these elements.
Add multilingual label, alias, & description editing (ie for languages other than the current UI language)
Add subject merge (for the inevitable duplicates which will occur)

Additional context

Traditionally library cataloging standards have used pre-coordinated subjects like "U.S. History -- World War II -- 1945" (made up, perhaps invalid, example) which we split apart into constituent elements during import, similar to FAST. The working assumption is that we'll continue to do that, but just making the assumption explicit here.

Stakeholders

⚠️ EDIT by @mekarpeles: Supplanted by #7904

xayhewalo commented 4 years ago

@hornc I added your personal label as I thought it was relevant.

LeadSongDog commented 4 years ago

@tfmorris Just spitballing here, but as a transitional step, will we not need to have support for both the existing free-form and whatever structured form is chosen?

tfmorris commented 4 years ago

will we not need to have support for both the existing free-form and whatever structured form is chosen?

No. My expectation is that we'll convert everything to structured form at once, but with perhaps imperfect resolution/merging of duplicates which will improve over time. ie we might have two different subjects with labels of "History" and "Histoire" but over time they'll get consolidated together into a single object (with redirects for the former merged subjects).

LeadSongDog commented 4 years ago

So then, what happens in the many cases where there's no structured form clearly equivalent to the old free-form? Do we have a catch-all?

tfmorris commented 4 years ago

There's no such case. See "everything" in my previous reply.

mekarpeles commented 1 year ago

In 2023 our intention is to start the Canonical Tags project whose project kickoff & proposal is outlined here: https://docs.google.com/document/d/1zrZAXgk2GEZRWb0D8tsrgaPzX4KdXHVt1s6ZQ4wUHLI/edit#

Details

At first, works will continue using existing subjects field to reference a list of strings of the form subject_name or type:value.
We will create a new infogami type called "Tags" (OL123T) which at their core have an internationalized name (e.g. {"eng": "Fantasy:}) and a type (e.g. subject, award, genre, content_warning, etc). For now, we will "progressively enhance" our existing subject system by creating a few of these new Tags and wiring them up against existing subject strings.
On /subject pages, the subject_name will be used to fetch a Tag from infobase (if such a tag exists). This rich Tag will provide internationalization, related tags, and additional metadata which can be curated by librarians and used to also enrich the /subject page UI.

I'm aware @tfmorris would prefer us moving directly to a system where, e.g., a work references a list of Tags (as opposed to strings). My interest is risk mitigating and turning this into a small integratable piece (similar to as @LeadSongDog describes) which can be rapidly prototyped and tested. Once we have confidence in the approach and the problem we're solving, we can invest more deeply in reconfiguring solr, updating all the works in infogami, and every piece of template + backend code which touches subjects, and all the additional clean up work which will eventually be required to make this switch. This is the direction I prefer, if I am going to be accountable for us successfully staffing + hitting milestones for this effort. We ultimately want the same outcome (switching from lists of strings to lists of 1st class Tag references).

Related issues:

7486 & #3233 -- enable bulk subject assignment & cleanup @mheiman
1896 @cdrini
65 -- Fix duplicated subjects @cdrini
[ ] Create new 1st class Tag type; e.g. OL123T (in prod + local)
[ ] Formalize a Tag schema which supports i18n to add to our schemata
[ ] Create mapper to resolve a Work.subject string → Tag document (if one exists)
[ ] Enhance a subject page to fetch/use auxiliary Tag data (pilot K-12 collection)
[ ] A human UI for editing Tags @jimchamp
[ ] Use subjects/tags to support #7416 @jimchamp
[ ] Eventually, migrate #5779 community review tags to canonical tags

tfmorris commented 1 year ago

I'm aware @tfmorris would prefer us moving directly to a system

Since this is the first sign of progress I've seen and I wasn't aware that the design was happening in the back rooms, I can't really say. I've added the Google doc with the design/plan to my list to review.

I will note, however, that a search for the terms MARC, BIBFRAME, FAST, LOD, Linked Data, Linked Open Data all turned up zero hits, so I'm a little concerned about interoperability with the Real World.

mekarpeles commented 1 year ago

Plan looks something like:

Before building anything sophisticated, I think a few things would be helpful:

[ ] Create new infogami Tag type OL…T (in prod + local dev environment) (essentially a json doc)
- may require some work from me, @cdrini, and @jimchamp (to make sure this type exists on dev instances)
[ ] Creating an experimental Tag document instance (e.g. for a collection, as a prototype) we can test
[ ] Seeing if we can synthesize a collection based on this collection Tag -- e.g. if someone goes to a /collections/<foo> and the page doesn't exist, then the controller will render a collections page based on the data in <foo> Tag.
- As per October focus (helping researchers) we may want to pilot an enhanced K-12 collection
- [ ] Create mapper to resolve a Work.subject string → Tag document (if one exists)
[ ] Trying to build a simple extension on the ILE so we can associate works with this tag by adding its Tag.name to a work's subjects list.

T.B.D.

[ ] Formalize schema for Tags

tfmorris commented 1 year ago

Subjects, Collections, Awards, and Censorship Warnings (e.g. NSFW) are all VERY different things. Attempting to smoosh them into a single schema creates unnecessary complexity and makes them more difficult to query.

Collections are simple sets (unordered) or lists (ordered) which are manually curated or dynamically created using search criteria.
Awards typically have a sponsoring organization and a date. They may be given to a contributor (author, illustrator, etc) or work.
Censorship categories depend on the geographical & political regime as well as the age, gender, religion, etc of the viewer. They are probably very difficult to model.
Subjects are very well developed and known in the library community. They are explicitly assigned by librarians in both MARC and BIBFRAME cataloging formats. Modern catalog records even include URLs for subjects which act as strong identifiers. Subjects are organized into taxonomic hierarchies giving them structure.

There's a vast trove of professionally curated subject assignments in the MARC library records which is currently greatly underutilized (e.g. no import of FAST URLs). The proposal makes no mention of how the MARC importer will be affected or how this interacts with BIBFRAME data which libraries are already trialing in production.

I also see no mention of how the existing subjects will be deduplicated and matched with the new Thing.Tag.Subject subjects. I'm worried that perhaps this is seen as an entirely manual process, which isn't scalable at all.

I expect this is a fait accompli which isn't open for community input, so I'll stop there.

mekarpeles commented 1 year ago

@tfmorris why do you presume fait accompli? We've had no less than 3 community calls on this topic + we're open to discussion here as well.

You're right, there are lots of data sources we can use to get data. One thing blocking importing is having a place to put data.

Open Library currently has works, editions, authors, lists, and several other types. These are all APIs to maintain. Today we have a system that works ~well for subjects in that:

one can edit any work's subjects and it gets indexed in solr
subjects can be any string
subject pages are created dynamically based on subject membership

There are also deficiencies:

a subject is just a string, lots of dupes
subject pages encode very little data other than a string (and works/authors which subscribe to this string). This leads librarians to hand-code /collections pages`.
Limited ability to support multiple types (currently just subjects -- which are misused -- and places, times, people)

I feel you're right that collections, subjects, moderation, and subjects all have different schema. They could all be constructed as independent entities with their own functions, APIs, solr integrations, import pipelines, edit + display UIs. This seems like it could be difficult to maintain when really what's important is that each of these types has mutually consistent schema. I'm imagining they all "inherit" from type Tag and share a schema according to their sub type (e.g. subject, collection, moderation, award, etc). An important aspect to me, from an engineering & implementation perspective, is that they share the same infogami API and we're not creating more types than we need and creating more exposure than we can cover.

You make a point that forcing tags to use the same schema increases complexity and reduces the ability to query. Wouldn't having tags all in one bucket decrease the complexity of the engineering even if it increases the complexity of tags itself? I agree it does transfer complexity from the system to the patron -- and given our resource constraints, to me this is an advantage in this specific situation.

With respect to querying, I agree that the combo of tags with types may make it more difficult to query in infogami, but I imagine the primary use case (i.e. how subjects are used today) is querying via solr and there are any number of optimizations we can make to aggregate a work's tags and usefully bucket them within solr. I intend for infogami to be the storage mechanism and keeping tags as simple, interoperable, and as extensible as possible architecturally I believe is to our benefit.

mekarpeles commented 1 year ago

Example generic Tag document could look something like:

{
  key: "/tag/OL1T",
  tag_type: "subject",
  tag_name: "fantasy",
  queries: [{
    "title": {"en": "Recent release"},
    "query": "...&sort=newest",
  }],
  exclusion: "title: ...",
  title: {
    "en": "Fantasy",
    "fr": "Fantaisie"
  },
  description: {
    "en": "..."
  }
  header_img: "https://media.istockphoto.com/id/1070683626/photo/magical-old-book-with-sparkles.jpg",
  children: ["/tag/OL2T", "/tag/OL33T"],
  neighbors: [],
  ... // additional schema related to tag_type
}

mekarpeles commented 1 year ago

I'll remain the lead for this issue but am going to mark @JaydenTeoh as the assignee as they've been making great progress (keep up the great work!)

mekarpeles commented 1 year ago

Related: #65

mekarpeles commented 1 year ago

Supplanted by #7904

tfmorris commented 1 year ago

In the case of duplicates most projects keep the oldest issue so that provenance, discussion, and age are preserved. This project seems to continually replace old, perfectly valid, with new issues. Why is that? Does someone's bonus depend on how long tickets have been open?

It looks like my assumption of "fait accompli" was accurate. No response to questions about how this relates to BIBFRAME, FAST, LCSH, Wikidata, MARC, or anything else in the real world. No response to comments in design document that was linked above. No attention being paid to centuries of library cataloging practice or the directions that the library community (the only source of high quality metadata for OL) is going.

For the record, my request for Subjects to become first class objects in OpenLibrary was not "completed" as the issue status seems to indicate, but instead roundly rejected. "Tags" may become useful some day and it may even be able to build Subjects as first class objects on top of them, but there is no plan or path which shows how (or even if) that is going to happen. I'm very disappointed.

jimchamp commented 1 year ago

The insults will surely help your case, Tom.

mekarpeles commented 1 year ago

@tfmorris, we simply have 2 issues that were similar, one was closed, one remains open, and both are linked together. You're right that there could have been a better way of preserving provenance -- but hey, we could be happy that someone is caring enough to at least go through 700+ issues and trying their best to dedupe at all. Furthermore, a section was explicitly added to the planning doc in response to updates we're planning for October: https://docs.google.com/document/d/1zrZAXgk2GEZRWb0D8tsrgaPzX4KdXHVt1s6ZQ4wUHLI/edit#heading=h.o9utr3tyh8k. Furthermore, we have made progress on several elements of this plan through GSoC this year. As well as anyone who has been involved with the project for several years, you know we're a small team doing the best we can and sometimes an older issue gets closed instead of a newer one and it's not done out of malice but rather an attempt to get things organized so we can make forward progress towards efforts that I know you care about.

In response to your questions about BIBFRAME, FAST, LCSH, Wikidata, MARC -- we continue to import classifications for these sources and also intend to links to sources within Tag documents (e.g. "this Tag is a classification from LCCN and here is its number").

We simply haven't gotten to that step of fully defining what is included in a Tag document because there's a lot of opportunities for integrating pieces we have confidence will be required. Much of the schema we imagine may be impacted by #7833 as we are interviewer 15 learners and educators to understand what types of affordances they may want beyond what we have in our current subject pages.

We've had dozens of calls on this topic, spanning staff, design team, engineering, and librarians, and at least 10 different people have been involved in weighing in, as have you (to the best of my ability) over github. This is one of many issues and it would be nice if it were appreciated that we doing the best we're able, and also that it is hard to fully respond to the feedback folks have over github which is why we do weekly community calls which I've tried hard to include you in.

Yes, I'm not perfect and will continue to make mistakes. I understand how frustrating advising and contributing to Open Library must feel under these constraints. All I can do is continue to be open to collaborating to the best of my ability and it would be nice if we could achieve that with good will and the compassion of two contributors who both care deeply about the project and doing right by our patrons.

internetarchive / openlibrary