internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5k stars 1.26k forks source link

Method to convert LCCs to LCC class names #3396

Open cdrini opened 4 years ago

cdrini commented 4 years ago

LCCs can be displayed as a ~path down the classification tree. These provide useful information which we want to display to the user. In order to do that, we need to be able to decode the LCC into classes. (This issue split off from #3290)

Describe the problem that you'd like solved

Want to be able to programmatically get the data on the right:

Sample LCC from a real book Expected Result
F1047 .C95 [
("History of the Americas", (F)),
("British American (including Canada)", (F1001, F1145.2)),
("British America", (F1001, F1145.2)),
("Canada", (F1001, F1145.2)),
("Maritime Provinces", (F1035.8)),
("Prince Edward Island", (F1046, F1049.7)),
]
NC760 .B2813 2004 [
("Visual Arts", (N)),
("Drawing. Design. Illustration", (NC)),
("Special subjects", (NC760, NC825)),
]
QH81 .C3525 1996 [
("Science", (Q)),
("Natural History - Biology", (QH)),
("Natural History (General)", (QH1, QH278.5)),
]
RF290 .E73 2009 [
("Medicine", (R)),
("Otorhinolaryngology", (RF)),
("Otology. Diseases of the ear", (RF110, RF320)),
]
NB699.N4 B4 1969b [
("Visual Arts", (N)),
("Sculpture", (NB)),
("History", (NB60, NB1115)),
]

See https://github.com/internetarchive/openlibrary/issues/3290 for more examples; not the table there is missing the first LCC class.

Proposal & Constraints

Notes:

Additional context

Stakeholders

@cclauss @BrittanyBunk

BrittanyBunk commented 4 years ago

@cdrini There are two outlines, the LCCO and the schedule outlines. @cclauss was using the schedule outlines: https://www.loc.gov/aba/cataloging/classification/. Should we use the LCCO if @cclauss's work is based on the schedules?

BrittanyBunk commented 4 years ago

Although incomplete, the LCCO is much easier to work with, because the schedules will have subclasses where the indentation is both forward and backward and idk how to visualize or program that in a way that makes to viewers and coders. The LCCO only indents forward, so the classes always come after each other (not both before and after each other).

An example would be when it looks like this in the schedules: ------subclass 1 subclass 2 ------subclass 3

Like how can that be represented easily? It can't. However, the LCCO can, because it looks like this: subclass 1 ----subclass 2 -------subclass 3

That's easy to represent. The only issue with the LCCO is that it's not the complete list of classes and subclasses, it's incomplete. The schedules is the complete one.

That's my current dilemma, where something needs to be sacrificed: 1) completeness, 2) accuracy in representation.

It's up to you and @cclauss which you choose. I think due to completeness and being official, the schedules is the best choice - as we could always find a way to represent the info, but we can't get easily what we're missing.

cdrini commented 4 years ago

I believe @cclauss is using the dumps from https://github.com/thisismattmiller/lcc-pdf-to-json . I think using those seems best, because we can get something working and experiment with it to see how it "feels" :+1: Whatever we choose is not set in stone. We can always adjust it to handle more complexity if we find we need to :)

A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system. - John Gall

BrittanyBunk commented 4 years ago

@cdrini agreed. Let's go with what's already being used before taking on more :) That said, what's next?

cdrini commented 4 years ago

Next step is once @cclauss has a method he thinks is ready, he or I can add it to the UI, and put it on dev.openlibrary.org for testing :) Does that seem correct @cclauss ?

cclauss commented 1 year ago

@cdrini Is this issue still useful?