Closed tfmorris closed 1 year ago
@hornc I added your personal label as I thought it was relevant.
@tfmorris Just spitballing here, but as a transitional step, will we not need to have support for both the existing free-form and whatever structured form is chosen?
will we not need to have support for both the existing free-form and whatever structured form is chosen?
No. My expectation is that we'll convert everything to structured form at once, but with perhaps imperfect resolution/merging of duplicates which will improve over time. ie we might have two different subjects with labels of "History" and "Histoire" but over time they'll get consolidated together into a single object (with redirects for the former merged subjects).
So then, what happens in the many cases where there's no structured form clearly equivalent to the old free-form? Do we have a catch-all?
There's no such case. See "everything" in my previous reply.
In 2023 our intention is to start the Canonical Tags project whose project kickoff & proposal is outlined here: https://docs.google.com/document/d/1zrZAXgk2GEZRWb0D8tsrgaPzX4KdXHVt1s6ZQ4wUHLI/edit#
subjects
field to reference a list of strings of the form subject_name
or type:value
.OL123T
) which at their core have an internationalized name (e.g. {"eng": "Fantasy:}
) and a type (e.g. subject
, award
, genre
, content_warning
, etc). For now, we will "progressively enhance" our existing subject system by creating a few of these new Tags and wiring them up against existing subject
strings./subject
pages, the subject_name
will be used to fetch a Tag from infobase (if such a tag exists). This rich Tag will provide internationalization, related tags, and additional metadata which can be curated by librarians and used to also enrich the /subject
page UI. I'm aware @tfmorris would prefer us moving directly to a system where, e.g., a work references a list of Tags (as opposed to strings). My interest is risk mitigating and turning this into a small integratable piece (similar to as @LeadSongDog describes) which can be rapidly prototyped and tested. Once we have confidence in the approach and the problem we're solving, we can invest more deeply in reconfiguring solr, updating all the works in infogami, and every piece of template + backend code which touches subjects, and all the additional clean up work which will eventually be required to make this switch. This is the direction I prefer, if I am going to be accountable for us successfully staffing + hitting milestones for this effort. We ultimately want the same outcome (switching from lists of strings to lists of 1st class Tag references).
OL123T
(in prod + local)I'm aware @tfmorris would prefer us moving directly to a system
Since this is the first sign of progress I've seen and I wasn't aware that the design was happening in the back rooms, I can't really say. I've added the Google doc with the design/plan to my list to review.
I will note, however, that a search for the terms MARC, BIBFRAME, FAST, LOD, Linked Data, Linked Open Data all turned up zero hits, so I'm a little concerned about interoperability with the Real World.
Before building anything sophisticated, I think a few things would be helpful:
OL…T
(in prod + local dev environment) (essentially a json doc)
Tag
document instance (e.g. for a collection, as a prototype) we can testTag
-- e.g. if someone goes to a /collections/<foo>
and the page doesn't exist, then the controller will render a collections page based on the data in <foo>
Tag
.
Work.subject
string → Tag
document (if one exists)Tag.name
to a work's subjects
list.Subjects, Collections, Awards, and Censorship Warnings (e.g. NSFW) are all VERY different things. Attempting to smoosh them into a single schema creates unnecessary complexity and makes them more difficult to query.
There's a vast trove of professionally curated subject assignments in the MARC library records which is currently greatly underutilized (e.g. no import of FAST URLs). The proposal makes no mention of how the MARC importer will be affected or how this interacts with BIBFRAME data which libraries are already trialing in production.
I also see no mention of how the existing subjects will be deduplicated and matched with the new Thing.Tag.Subject subjects. I'm worried that perhaps this is seen as an entirely manual process, which isn't scalable at all.
I expect this is a fait accompli which isn't open for community input, so I'll stop there.
@tfmorris why do you presume fait accompli? We've had no less than 3 community calls on this topic + we're open to discussion here as well.
You're right, there are lots of data sources we can use to get data. One thing blocking importing is having a place to put data.
Open Library currently has works, editions, authors, lists, and several other types. These are all APIs to maintain. Today we have a system that works ~well for subjects in that:
There are also deficiencies:
/collections
pages`.I feel you're right that collections, subjects, moderation, and subjects all have different schema. They could all be constructed as independent entities with their own functions, APIs, solr integrations, import pipelines, edit + display UIs. This seems like it could be difficult to maintain when really what's important is that each of these types has mutually consistent schema. I'm imagining they all "inherit" from type Tag and share a schema according to their sub type
(e.g. subject, collection, moderation, award, etc). An important aspect to me, from an engineering & implementation perspective, is that they share the same infogami API and we're not creating more types than we need and creating more exposure than we can cover.
You make a point that forcing tags to use the same schema increases complexity and reduces the ability to query. Wouldn't having tags all in one bucket decrease the complexity of the engineering even if it increases the complexity of tags itself? I agree it does transfer complexity from the system to the patron -- and given our resource constraints, to me this is an advantage in this specific situation.
With respect to querying, I agree that the combo of tags with types may make it more difficult to query in infogami, but I imagine the primary use case (i.e. how subjects are used today) is querying via solr
and there are any number of optimizations we can make to aggregate a work's tags and usefully bucket them within solr. I intend for infogami to be the storage mechanism and keeping tags as simple, interoperable, and as extensible as possible architecturally I believe is to our benefit.
Example generic Tag
document could look something like:
{
key: "/tag/OL1T",
tag_type: "subject",
tag_name: "fantasy",
queries: [{
"title": {"en": "Recent release"},
"query": "...&sort=newest",
}],
exclusion: "title: ...",
title: {
"en": "Fantasy",
"fr": "Fantaisie"
},
description: {
"en": "..."
}
header_img: "https://media.istockphoto.com/id/1070683626/photo/magical-old-book-with-sparkles.jpg",
children: ["/tag/OL2T", "/tag/OL33T"],
neighbors: [],
... // additional schema related to tag_type
}
I'll remain the lead for this issue but am going to mark @JaydenTeoh as the assignee as they've been making great progress (keep up the great work!)
Related: #65
Supplanted by #7904
In the case of duplicates most projects keep the oldest issue so that provenance, discussion, and age are preserved. This project seems to continually replace old, perfectly valid, with new issues. Why is that? Does someone's bonus depend on how long tickets have been open?
It looks like my assumption of "fait accompli" was accurate. No response to questions about how this relates to BIBFRAME, FAST, LCSH, Wikidata, MARC, or anything else in the real world. No response to comments in design document that was linked above. No attention being paid to centuries of library cataloging practice or the directions that the library community (the only source of high quality metadata for OL) is going.
For the record, my request for Subjects to become first class objects in OpenLibrary was not "completed" as the issue status seems to indicate, but instead roundly rejected. "Tags" may become useful some day and it may even be able to build Subjects as first class objects on top of them, but there is no plan or path which shows how (or even if) that is going to happen. I'm very disappointed.
The insults will surely help your case, Tom.
@tfmorris, we simply have 2 issues that were similar, one was closed, one remains open, and both are linked together. You're right that there could have been a better way of preserving provenance -- but hey, we could be happy that someone is caring enough to at least go through 700+ issues and trying their best to dedupe at all. Furthermore, a section was explicitly added to the planning doc in response to updates we're planning for October: https://docs.google.com/document/d/1zrZAXgk2GEZRWb0D8tsrgaPzX4KdXHVt1s6ZQ4wUHLI/edit#heading=h.o9utr3tyh8k. Furthermore, we have made progress on several elements of this plan through GSoC this year. As well as anyone who has been involved with the project for several years, you know we're a small team doing the best we can and sometimes an older issue gets closed instead of a newer one and it's not done out of malice but rather an attempt to get things organized so we can make forward progress towards efforts that I know you care about.
In response to your questions about BIBFRAME, FAST, LCSH, Wikidata, MARC -- we continue to import classifications for these sources and also intend to links to sources within Tag documents (e.g. "this Tag is a classification from LCCN and here is its number").
We simply haven't gotten to that step of fully defining what is included in a Tag document because there's a lot of opportunities for integrating pieces we have confidence will be required. Much of the schema we imagine may be impacted by #7833 as we are interviewer 15 learners and educators to understand what types of affordances they may want beyond what we have in our current subject pages.
We've had dozens of calls on this topic, spanning staff, design team, engineering, and librarians, and at least 10 different people have been involved in weighing in, as have you (to the best of my ability) over github. This is one of many issues and it would be nice if it were appreciated that we doing the best we're able, and also that it is hard to fully respond to the feedback folks have over github which is why we do weekly community calls which I've tried hard to include you in.
Yes, I'm not perfect and will continue to make mistakes. I understand how frustrating advising and contributing to Open Library must feel under these constraints. All I can do is continue to be open to collaborating to the best of my ability and it would be nice if we could achieve that with good will and the compassion of two contributors who both care deeply about the project and doing right by our patrons.
Subjects are currently treated as strings, with light normalization to coalesce similar strings, which limits our flexibility to do things like support aliases, multiple languages, metadata such as descriptions, links to Wikidata, etc.
Proposal & Constraints
Subjects should be first class objects with a set of attributes including:
Component Updates
Additional context
Traditionally library cataloging standards have used pre-coordinated subjects like "U.S. History -- World War II -- 1945" (made up, perhaps invalid, example) which we split apart into constituent elements during import, similar to FAST. The working assumption is that we'll continue to do that, but just making the assumption explicit here.
Stakeholders
⚠️ EDIT by @mekarpeles: Supplanted by #7904