internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.1k stars 1.33k forks source link

Add LCC and Dewey decimal numbers to solr in April solr reindex #3290

Closed cdrini closed 4 years ago

cdrini commented 4 years ago

As discussed in the community call this past Tu, we would like to try implementing some sort of beta interface that lets users explore the LoC classification (or maybe dewey decimal) in openlibrary. See https://www.loc.gov/catdir/cpso/lcco/ . The first step of this would be to store the data into solr (which it currently isn't; e.g http://server.openjournal.foundation:8984/solr/select/?q=key%3A%2Fworks%2FOL3773057W&version=2.2&start=0&rows=10&indent=on vs https://openlibrary.org/books/OL2543776M/Course_Design ).

Describe the problem that you'd like solved

Proposal & Constraints

Additional context

Stakeholders

@cclauss @finnless @tfmorris

cdrini commented 4 years ago

Sample data as stored in OL:

Key dewey_decimal_class Key lc_classifications
OL1000071M 303.6/9 OL1025841M HB1951 .R64 1995
OL1000602M 820.9/9287/0966 OL1025966M DP402.C8 O46 1995
OL1000884M 863 OL1026156M CS879 .R3 1995
OL1001241M 811/.54 OL1026211M NC248.S22 A4 1992
OL1001366M 658.8 OL102629M TJ563 .P66 1998
OL1001472M 330.94/051 OL1026596M PQ3919.2.M2866 C83 1994
OL1001537M 635.9/67 OL1026624M NA2500 .H64 1995
OL1001681M 651.8/536 OL1026668M PN517 .L38 1994
OL1002024M [Fic] OL1026747M MLCM 95/14118 (P)
OL1002068M 333.79/14 OL102706M QA331.3 .M39 1998
OL1002147M [Fic] OL1027106M PT8951.12.R5 M56 1980
OL1002396M 291.1/3 OL1027418M MLCS 96/04520 (P)
OL1002411M 574.19/2/076 OL1027454M HQ755.8 .T63 1995
OL100249M 342.73/087 OL1028019M MLCS 97/02275 (T)
OL1002756M 741.5/973 OL1028055M PZ70.C9 F657 1995
OL1003092M [E] OL1028253M HC241 .G683 1995
OL1003197M 813/.54 OL1028626M MLCS 95/08574 (U)
OL1004484M 398.24/528617 OL1028701M HC371 .M45 nr. 122
OL1005872M 003 OL102878M MLCS 2002/05802 (P)
OL1005937M 635/.0207 OL1029016M IN PROCESS
OL1006217M 782.1/4/0973 OL102935M KLA940 .K65 1990
OL1006312M 617.5/85 OL1029463M KHA878 .G37 1996
OL1007188M 005.13/3 OL1029540M KHH3003 .Q57 1995
OL1007427M 823/.8 OL1030429M TX819.A1 T733 1991
OL1007504M 418 OL1030465M PQ7298.12.A40 S26 1987
OL1007548M 747.7/97 OL1030780M HM216 .G44 1993
OL1007742M [E] OL1030894M SD409 .A38 1990
OL1008589M 332/.09172/6 OL1031493M J451 .N4 1990z
OL1008660M 355/.0330536 OL1031615M TR850 .F88 1993
OL1008703M 569/.67 OL1031659M MLCS 93/14492 (P)
OL1008934M 297/.092/2 OL1031710M KK2222 .L36 1993
OL1008978M 155.9/37 OL1031822M G525 .M486 1991
OL1009410M [E] OL1032690M HM261 .H47 1993
OL1009515M 570/.9 OL1032795M PQ8098.23.O516 L38 1988
OL1009559M 953.63 OL1032953M PL191 .I94 1992
OL1009793M 940.2/742/092 OL1033073M LF3194.C65 A657 1992
OL1009887M 306.4/84 OL103482M PG5438.V25 J47 1999
OL1009928M 808 OL1035916M HG1615 .M32 1993
OL1009964M 730/.92 OL1036001M KF27 .A3 1992h
OL1010856M 629.132/51 OL103608M PT1937.A1 G35 1999
OL1011037M 304.2 OL1036126M MLCS 98/02371 (H)
OL1011172M 613.7/042 OL1036553M MLCM 93/05262 (D)
OL1011381M 158/.3 OL1036719M KF3613.4 .C34
OL1011666M 133.5/946 OL1036755M DR1313.3 .U54 1993
OL1011741M 363.739/463/09496 OL1037020M DS557.8.M9 B55 1992b
OL1012371M 512.9/42 OL1037176M DR82 .G46 1993
OL1012502M 289.9/4/0922 OL1037305M PT2678.E3393 S36 1993
OL1013206M 363.2/2/02373 OL1037349M HN530.2.A85 I86 1992
OL1013754M 863 OL1037631M TK5105.5 .O653 1993
OL1013897M 937/.06 OL1038111M AM79.5.B26 B34 1993
BrittanyBunk commented 4 years ago

Great starting point!

For (my personal) reference:

I notice some of these have [Fic] or [E]. Because of this being less reliable (like there are multiple E's), I'm assuming that the LOC would be the better ID to start with (i.e. more specific), especially since those books do have it. I'm assuming it'll be both for each edition in the end, so I won't worry.

tfmorris commented 4 years ago

This seems super low priority. I assume that the MANY other Solr bug fixes and feature requests will be waaayyy ahead of this.

BrittanyBunk commented 4 years ago

@tfmorris Here's my assumption of it: this first step is small, but what comes next is important when genres/sub-genres could be attached to editions - this'll clean up and help with the subject pages.

cclauss commented 4 years ago

@finnless suggested that I look at https://github.com/thisismattmiller/lcc-pdf-to-json which made it easy for me to create a lc_classifier function:

def lcc_to_subject(lcc: str) -> str:
    """
    >>> lcc_to_subject("ZA3201")
    'Information superhighway'
    """

The output could be Information resources (General): Information superhighway instead.

cclauss commented 4 years ago
lc_classification clean lc_subjects
HB1951 .R64 1995 HB1951 ['Economic theory. Demography', 'Demography. Population. Vital events']
DP402.C8 O46 1995 DP402 ['History of Spain', 'Local history and description', 'Other cities, towns, etc., A-Z']
CS879 .R3 1995 CS879 ['Genealogy', 'By region or country']
NC248.S22 A4 1992 NC248 ['Drawing. Design. Illustration', 'History of drawing']
TJ563 .P66 1998 TJ563 ['Mechanical engineering and machinery', 'Steam engineering']
PQ3919.2.M2866 C83 1994 PQ3919 ['French literature', 'Provincial, local, colonial, etc.']
NA2500 .H64 1995 NA2500 ['Architecture', 'General works']
PN517 .L38 1994 PN517 ['Literature (General)', 'Literary history', 'Collections']
MLCM 95/14118 (P) MLCM95 []
QA331.3 .M39 1998 QA331 ['Mathematics', 'Analysis']
PT8951.12.R5 M56 1980 PT8951 ['Norwegian literature', 'Individual authors or works', '1961-2000']
MLCS 96/04520 (P) MLCS96 []
HQ755.8 .T63 1995 HQ755 ['The Family. Marriage. Women', 'The family. Marriage. Home', 'Eugenics']
MLCS 97/02275 (T) MLCS97 []
PZ70.C9 F657 1995 PZ70 ['Fiction and juvenile belles lettres', 'Juvenile belles lettres']
HC241 .G683 1995 HC241 ['Economic history and conditions', 'By region or country']
MLCS 95/08574 (U) MLCS95 []
HC371 .M45 nr. 122 HC371 ['Economic history and conditions', 'By region or country']
MLCS 2002/05802 (P) MLCS2002 []
IN PROCESS INPROCESS []
KLA940 .K65 1990 KLA940 ['Russia, Soviet Union']
KHA878 .G37 1996 KHA878 ['Argentina']
KHH3003 .Q57 1995 KHH3003 ['Colombia']
TX819.A1 T733 1991 TX819 ['Home economics', 'Cooking']
PQ7298.12.A40 S26 1987 PQ7298 ['Spanish literature', 'Provincial, local, colonial, etc.', 'Spanish America']
HM216 .G44 1993 HM216 ['Sociology', 'These are obsolete numbers no longer used']
SD409 .A38 1990 SD409 ['Forestry', 'Sylviculture']
J451 .N4 1990z J451 ['General legislative and executive papers', 'Other regions and countries']
TR850 .F88 1993 TR850 ['Photography', 'Cinematography. Motion pictures']
MLCS 93/14492 (P) MLCS93 []
KK2222 .L36 1993 KK2222 ['Law of Germany', 'Commercial law', 'Commercial transactions', 'Banking. Stock exchange']
G525 .M486 1991 G525 ['Geography (General)', 'Adventures, shipwrecks, buried treasure, etc.']
HM261 .H47 1993 HM261 ['Sociology', 'These are obsolete numbers no longer used']
PQ8098.23.O516 L38 1988 PQ8098 ['Spanish literature', 'Provincial, local, colonial, etc.', 'Spanish America']
PL191 .I94 1992 PL191 ['Languages of Eastern Asia, Africa, Oceania', 'Ural-Altaic languages', 'Turkic languages']
LF3194.C65 A657 1992 LF3194 ['Individual institutions', 'Germany']
PG5438.V25 J47 1999 PG5438 ['Slavic. Baltic. Albanian', 'Slavic', 'Slovak']
HG1615 .M32 1993 HG1615 ['Finance', 'Banking']
KF27 .A3 1992h KF27 ['Law of the United States (Federal)', 'Congressional documents']
PT1937.A1 G35 1999 PT1937 ['German literature', 'Individual authors or works', '1700-ca. 1860/70', 'Goethe', 'Works']
MLCS 98/02371 (H) MLCS98 []
MLCM 93/05262 (D) MLCM93 []
KF3613.4 .C34 KF3613 ['Law of the United States (Federal)', 'Social legislation', 'Social insurance']
DR1313.3 .U54 1993 DR1313 ['History of Balkan Peninsula', 'Yugoslavia', 'History', 'By period', '1918-', 'Yugoslav War, 1991-1995']
DS557.8.M9 B55 1992b DS557 ['History of Asia', 'Southeast Asia', 'French Indochina', 'Vietnam. Annam', 'Vietnamese Conflict']
DR82 .G46 1993 DR82 ['History of Balkan Peninsula', 'Bulgaria', 'History', 'By period', 'Turkish rule, 1396-1878']
PT2678.E3393 S36 1993 PT2678 ['German literature', 'Individual authors or works', '1961-2000']
HN530.2.A85 I86 1992 HN530 ['Social history and conditions. Social problems.', 'By region or country']
TK5105.5 .O653 1993 TK5105 ['Electrical engineering. Electronics. Nuclear', 'Telecommunication']
AM79.5.B26 B34 1993 AM79 ['Museums. Collectors and collecting', 'By country']
cdrini commented 4 years ago

@cclauss That looks awesome! Hmmm, we should display these on pages we have an LCC; maybe something like this?

image

Then once these are in solr, we can make each level clickable, leading to search page 😍 . But I don't think this needs to be blocked by that happening. They still add value + improve SEO even if they're just text!

cdrini commented 4 years ago

@tfmorris After all the work that went into #1067 , which largely blocked most modification to solr until completion (and even now it's still stuck in PR -_-). I wanted to work on something small, scoped, and impactful that takes advantage of/showcases our new super power (full reindexing!).

BrittanyBunk commented 4 years ago

@cdrini I spoke with @cclauss and thought up a kind of a new idea/way of thinking about it. We could do both my idea and your format, that's fine. I just want to say mine and how it'll look like with yours: Since not every LC classification led to a corresponding class in @cclauss's example, to make it be able to have every one have a corresponding class with it, here's my process: Use only the 1st 2 letters to generate genres: the 1st letter is the genre and the 2nd letter is the sub-genre. Using the entire list provided by @cclauss: https://www.questionpoint.org/crs/html/help/en/ask/ask_map_lcctoddc.html which would generate the output. Here's what it'd look like at the end (ignore the poor formatting): image

cclauss commented 4 years ago

@BrittanyBunk and I slacked on this 12 hours ago and she proposed the same two-letter thing. The letters A, D, and J threw me because that table does not provide single-letter meanings but she provided them to me. So I will propose a new PR that shows us how to get the first three classifications so we see how it looks and then we can choose wether use just the two letters or letters plus numbers.

BrittanyBunk commented 4 years ago

@cclauss ok. I see why you're coming into issues. The site you showed me is incomplete (as it's used for dewey dec conversions, and dewey dec is not as robust as the LoC). The official one to use is complete. This equivalent should be the complete version to use (I'd just download it to a doc just in case it gets changed) (although it might need to be double checked just in case).

cclauss commented 4 years ago

AC --> General Works: Collections. Series. Collected works AE --> General Works: Encyclopedias

Long keys: [DAW, DJK, KBM, KBP, KBR, KBS, KBT, KBU, KD/KDK, KDZ, KJ-KKZ, KL-KWX, KU/KUQ]

That parses to 230 records:

{
  "A": "General Works",
  "AC": "Collections. Series. Collected works",
  "AE": "Encyclopedias",
  "AG": "Dictionaries and other general reference works",
  "AI": "Indexes",
  "AM": "Museums. Collectors and collecting",
  "AN": "Newspapers",
  "AP": "Periodicals",
  "AS": "Academies and learned societies",
  "AY": "Yearbooks. Almanacs. Directories",
  "AZ": "History of scholarship and learning. The humanities",
  "B": "Philosophy, Psychology, Religion",
  "BC": "Logic",
  "BD": "Speculative philosophy",
  "BF": "Psychology",
  "BH": "Aesthetics",
  "BJ": "Ethics",
  "BL": "Religions. Mythology. Rationalism",
  "BM": "Judaism",
  "BP": "Islam. Bahaism. Theosophy, etc.",
  "BQ": "Buddhism",
  "BR": "Christianity",
  "BS": "The Bible",
  "BT": "Doctrinal theology",
  "BV": "Practical Theology",
  "BX": "Christian Denominations",
  "C": "Auxiliary Sciences of History",
  "CB": "History of Civilization",
  "CC": "Archaeology",
  "CD": "Diplomatics. Archives. Seals",
  "CE": "Technical Chronology. Calendar",
  "CJ": "Numismatics",
  "CN": "Inscriptions. Epigraphy",
  "CR": "Heraldry",
  "CS": "Genealogy",
  "CT": "Biography",
  "D": "History, General and Old World",
  "DA": "Great Britain",
  "DAW": "Central Europe",
  "DB": "Czechoslovakia",
  "DC": "Monaco",
  "DD": "Germany",
  "DE": "Greco-Roman World",
  "DF": "Greece",
  "DG": "Malta",
  "DH": "Benelux Countries",
  "DJ": "Netherlands (Holland)",
  "DJK": "Eastern Europe (General)",
  "DK": "Poland",
  "DL": "Northern Europe. Scandinavia",
  "DP": "Portugal",
  "DQ": "Switzerland",
  "DR": "Balkan Peninsula",
  "DS": "Asia",
  "DT": "Africa",
  "DU": "Oceania (South Seas)",
  "DX": "Romanies",
  "E": "History of America",
  "F": "Local History of the United States and British, Dutch, French, and Latin America",
  "G": "Geography. Anthropology. Recreation",
  "GA": "Mathematical geography. Cartography",
  "GB": "Physical geography",
  "GC": "Oceanography",
  "GE": "Environmental Sciences",
  "GF": "Human ecology. Anthropogeography",
  "GN": "Anthropology",
  "GR": "Folklore",
  "GT": "Manners and customs (General)",
  "GV": "Recreation. Leisure",
  "H": "Social sciences",
  "HA": "Statistics",
  "HB": "Economic theory. Demography",
  "HC": "Economic history and conditions",
  "HD": "Industries. Land use. Labor",
  "HE": "Transportation and communications",
  "HF": "Commerce",
  "HG": "Finance",
  "HJ": "Public finance",
  "HM": "Sociology (General)",
  "HN": "Social history and conditions. Social problems. Social reform",
  "HQ": "The family. Marriage, Women and Sexuality",
  "HS": "Societies: secret, benevolent, etc.",
  "HT": "Communities. Classes. Races",
  "HV": "Social pathology. Social and public welfare. Criminology",
  "HX": "Socialism. Communism. Anarchism",
  "J": "Political science",
  "JA": "Political science (General)",
  "JC": "Political theory",
  "JF": "Political institutions and public administration",
  "JJ": "Political institutions and public administration (North America)",
  "JK": "Political institutions and public administration (United States)",
  "JL": "Political institutions and public administration (Canada, Latin America, etc.)",
  "JN": "Political institutions and public administration (Europe)",
  "JQ": "Political institutions and public administration (Asia, Africa, Australia, Pacific Area, etc.)",
  "JS": "Local government. Municipal government",
  "JV": "Colonies and colonization. Emigration and immigration. International migration",
  "JX": "International law, see JZ and KZ (obsolete)",
  "JZ": "International relations",
  "K": "Law",
  "KB": "Religious law in general. Comparative religious law. Jurisprudence",
  "KBM": "Jewish law",
  "KBP": "Islamic law",
  "KBR": "History of canon law",
  "KBS": "Canon law of Eastern churches",
  "KBT": "Canon law of Eastern Rite Churches in Communion with the Holy See of Rome",
  "KBU": "Law of the Roman Catholic Church. The Holy See",
  "KD/KDK": "United Kingdom and Ireland",
  "KDZ": "America. North America",
  "KE": "Canada",
  "KF": "United States",
  "KG": "West Indies. Caribbean area",
  "KH": "South America",
  "KJ-KKZ": "Europe",
  "KL-KWX": "Asia and Eurasia, Africa, Pacific Area, and Antarctica",
  "KU/KUQ": "Law of Australia and New Zealand",
  "KZ": "Law of nations",
  "L": "Education",
  "LA": "History of education",
  "LB": "Theory and practice of education",
  "LC": "Special aspects of education",
  "LD": "United States",
  "LE": "America (except United States)",
  "LF": "Europe",
  "LG": "Asia, Africa, Indian Ocean islands, Australia, New Zealand, Pacific islands",
  "LH": "College and school magazines and papers",
  "LJ": "Student fraternities and societies, United States",
  "LT": "Textbooks",
  "M": "Music",
  "ML": "Literature on music",
  "MT": "Instruction and study",
  "N": "Fine Arts",
  "NA": "Architecture",
  "NB": "Sculpture",
  "NC": "Drawing. Design. Illustration",
  "ND": "Painting",
  "NE": "Print media",
  "NK": "Decorative arts",
  "NX": "Arts in general",
  "P": "Language and Literature",
  "PA": "Greek language and literature. Latin language and literature",
  "PB": "Modern languages. Celtic languages and literature",
  "PC": "Romanic languages",
  "PD": "Germanic languages. Scandinavian languages",
  "PE": "English language",
  "PF": "West Germanic languages",
  "PG": "Slavic languages and literatures. Baltic languages. Albanian language",
  "PH": "Uralic languages. Basque language",
  "PJ": "Oriental languages and literatures",
  "PK": "Indo-Iranian languages and literatures",
  "PL": "Languages and literatures of Eastern Asia, Africa, Oceania",
  "PM": "Hyperborean, Native American, and artificial languages",
  "PN": "Literature (General)",
  "PQ": "Portuguese literature",
  "PR": "English literature",
  "PS": "American literature",
  "PT": "Swedish literature",
  "PZ": "Fiction and juvenile belles lettres",
  "Q": "Science",
  "QA": "Mathematics",
  "QB": "Astronomy",
  "QC": "Physics",
  "QD": "Chemistry",
  "QE": "Geology",
  "QH": "Biology",
  "QK": "Botany",
  "QL": "Zoology",
  "QM": "Human anatomy",
  "QP": "Physiology",
  "QR": "Microbiology",
  "R": "Medicine",
  "RA": "Public aspects of medicine",
  "RB": "Pathology",
  "RC": "Internal medicine",
  "RD": "Surgery",
  "RE": "Ophthalmology",
  "RF": "Otorhinolaryngology",
  "RG": "Gynecology and Obstetrics",
  "RJ": "Pediatrics",
  "RK": "Dentistry",
  "RL": "Dermatology",
  "RM": "Therapeutics. Pharmacology",
  "RS": "Pharmacy and materia medica",
  "RT": "Nursing",
  "RV": "Botanic, Thomsonian, and Eclectic medicine",
  "RX": "Homeopathy",
  "RZ": "Other systems of medicine",
  "S": "Agriculture",
  "SB": "Horticulture. Plant propagation. Plant breeding",
  "SD": "Forestry. Arboriculture. Silviculture",
  "SF": "Animal husbandry. Animal science",
  "SH": "Aquaculture. Fisheries. Angling",
  "SK": "Hunting",
  "T": "Technology",
  "TA": "Engineering Civil engineering (General).",
  "TC": "Hydraulic engineering. Ocean engineering",
  "TD": "Environmental technology. Sanitary engineering",
  "TE": "Highway engineering. Roads and pavements",
  "TF": "Railroad engineering and operation",
  "TG": "Bridges",
  "TH": "Building construction",
  "TJ": "Mechanical engineering and machinery",
  "TK": "Electrical engineering. Electronics. Nuclear engineering",
  "TL": "Motor vehicles. Aeronautics. Astronautics",
  "TN": "Mining engineering. Metallurgy",
  "TP": "Chemical technology",
  "TR": "Photography",
  "TS": "Manufacturing engineering. Mass production",
  "TT": "Handicrafts. Arts and crafts",
  "TX": "Home economics",
  "U": "Military Science",
  "UA": "Armies: Organization, distribution, military situation",
  "UB": "Military administration",
  "UC": "Military maintenance and transportation",
  "UD": "Infantry",
  "UE": "Cavalry. Armor",
  "UF": "Artillery",
  "UG": "Military engineering. Air forces",
  "UH": "Other military services",
  "V": "Naval Science",
  "VA": "Navies: Organization, distribution, naval situation",
  "VB": "Naval administration",
  "VC": "Naval maintenance",
  "VD": "Naval seamen",
  "VE": "Marines",
  "VF": "Naval ordnance",
  "VG": "Minor services of navies",
  "VK": "Navigation. Merchant marine",
  "VM": "Naval architecture. Shipbuilding. Marine engineering",
  "Z": "Bibliography. Library Science. Information resources",
  "ZA": "Information resources/materials"
}
BrittanyBunk commented 4 years ago

Cool! So now that we have that, we could use this for the DDC too! I tried to create an excel with LCC -> DDC and vice versa, but didn't get far enough. Maybe it could be coded, but here's the start: https://drive.google.com/file/d/1Yu-srlXD_FcUUTRV9lwseXrR7qEsNQ9a/view?usp=sharing

cdrini commented 4 years ago

UI-wise, let's start with just classes for now; I think having subjects, classes, genre, and sub-genre might be a little too much/confusing. I think the classes might be better since it also displays full granularity (so folks can dive in at any point they wish).

cclauss commented 4 years ago

Agreed. Let’s also get LC classes working smoothly & consistently before also doing DDC. The .pdf you added is great but highlights the complexity of getting it right.

cdrini commented 4 years ago

Baby steps :) To quote one of my new favourite laws (thanks @LeadSongDog !)

Gall's Law: A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

BrittanyBunk commented 4 years ago

@cdrini Just so we're all on the same page, you mean what's on https://www.loc.gov/catdir/cpso/lcco/ only right? Like https://openlibrary.org/books/OL103608M/Johann_Wolfgang_Goethe_Faust-Dichtungen. would show "Language and Literature" or what's in the image you posted?

cdrini commented 4 years ago

I mean the classes section displayed here: https://github.com/internetarchive/openlibrary/issues/3290#issuecomment-608143138

BrittanyBunk commented 4 years ago

@cdrini @cclauss So something like this then (I completed the full path) (*Ignore the fonts - just a mockup): image Outline is just to use the name on the LoC site - but I'll research the correct term. @seabelis would you know by any chance?

cdrini commented 4 years ago

Ahhh, I see what you mean. Yes, exactly :+1:

BrittanyBunk commented 4 years ago

I'm thinking LCC titles, because the 'outline' is the name of the entire listing and we're just showing the individual titles of the LCC letters and numbers, but will defer to expert opinion if there is a name for this.

seabelis commented 4 years ago

Outline is just to use the name on the LoC site - but I'll research the correct term. @seabelis would you know by any chance?

Outline is the title of that specific page because the contents of the page is an outline of the LC Classification system.

Let's make the heading what it is "Library of Congress Classification" or "LC Classification." Would be nice to include the call number itself, similar to what is displayed on the LC catalog record as 'browse by shelf order" with the added functionality of the class names themselves as links. See https://catalog.loc.gov/vwebv/holdingsInfo?searchId=33155&recCount=25&recPointer=0&bibId=21468183, about half-way down the record.

BrittanyBunk commented 4 years ago

@seabelis thanks for helping me out.

seabelis commented 4 years ago

You are showing this at the edition level. I understood this was going to be used at the work-level.

BrittanyBunk commented 4 years ago

I didn't see it being there unless we're calling it a 'genre'. However, since it's a classification based on a classification number of a book, this seems to be a better place. I will double check on this right now.

seabelis commented 4 years ago

I didn't see it being there unless we're calling it a 'genre'.

Why? This is not a genre. This is a classification.

BrittanyBunk commented 4 years ago

Exactly! That's why it goes underneath the classification tab, as each edition has a different call number. Even though most are in the same category as each other, this is just in case they aren't: https://openlibrary.org/books/OL24172776M/Hamlet and https://openlibrary.org/books/OL24614660M/Der_erste_deutsche_B%C3%BChnen-Hamlet - different editions, different call numbers. They're the same category, so I'll keep looking - as I want to be sure of where to place it - as you're right - it's unclear.

BrittanyBunk commented 4 years ago

I can't find one right now, so I wouldn't know where would be better - the works page or under Classifications.

cdrini commented 4 years ago

Yes, it is a little awkward since the LCC is stored on the edition (and can vary!). But I think we should display it on the work near the subjects section, because I think that will make it easier to find for non-librarian users. I chose the word "classes" instead of "Classifications" or "LCC" also hoping that might be easier for novice users. When DDC are eventually added, they can also appear in the "classes" section.

I think we need to wait for some ui demos to be implemented to see how these look and feel before we can make a final decision :)

BrittanyBunk commented 4 years ago

@cdrini Having it on the works page should be fine, as the LCC should be the same for all the editions - they all should be of the same topic.

I didn't use 'classes' as it's a combination of classes and subclasses (so I got confused), but I see what you mean. I'll wait until those are finished then before proceeding further.

tfmorris commented 4 years ago

@cdrini There are 48 open Solr bugs, some over a decade old. Do none of them meet your criteria?

If you want help choosing, I'll suggest #178 which is small, self-contained, HUGELY impactful, and just over a decade old, having been first reported March 13, 2010. By simply changing the definition of a single field, the author's name, users will now be able to find this record with 7 1 works for René-Aubert Vertot rather than this orphan with a single work when they search for Rene Vertot.

If you search for Renee Shann, you won't find ANY of the 107 works that OpenLibrary has cataloged.

If a librarian like @seabelis wanted to merge the 9 different Rene Char records, they'd need to search twice and then stitch the results together by hand.

In the face of all this, and they myriad other Solr issues, we're going to invent an entirely new, never before requested, issue to waste time on?

That is doing our patrons a HUGE disservice.

BrittanyBunk commented 4 years ago

@tfmorris I don't like getting involved in other people's discussions, but some things are important to say. This github issue that @cdrini's working on is something that's been going on for a while and requires a lot of people's help and right now's the moment that the resources are here. Also, doing this helps with future developments. It's an infrastructure that will make books easier to find - that includes the #178 you mentioned. Drini mentioned in the community meeting that reindexing the solr project is going to fix the inability to search by non-English characters, so it seems a little misguided. I would read up on the community meeting notes, especially 3-31-20 - which shows what I mean.

BrittanyBunk commented 4 years ago

Sorry @cdrini for continuing after you said to you wanted to focus on getting the indexing right, but since the labeling was discussed in the meeting, I just wanted to give another input on this: I realize that 'LCC titles' might be more appropriate than 'Library of Congress Classification'. The reason is that the call number is already called that on the OL and the LCCO page says that it's letters (and I'm assuming numbers) and 'titles' of an LC classification (first sentence) and the LoC calls the call number the LC classification (although they are inconsistent on some pages)." @seabelis @cclauss

cdrini commented 4 years ago

@BrittanyBunk Thanks Brittany! Reindexing is one of the blockers for allowing us to work on #178; it unfortunately doesn't impact the issue itself; that must've been a typo in the notes.

@tfmorris Yep; I'm aware of that issue. I believe updating to solr 8 (#3317) is more important (which is why I've also taken that up in my milestone for this month). Trying to fix #178 before #3317 would require investing time into installing solr 3.6 specific plugins / config, all of which would have to get redone once we do #3317. We've had this discussion before; one of my first PRs on openlibrary was a fix to #178, #599 ; So I'm fully aware of how important that issue is. We decided that although using ASCII Folding (which is what I did in #599 ) was an improvement, it wasn't that great for non-English languages, and that ICUFolding Filter (as you've done in your solr PR) was most correct ( See https://github.com/internetarchive/openlibrary/pull/599#issuecomment-345490579 ). This filter requires us to add plugins to solr (which I even did on a branch off #599). But I think adding plugins to solr in 3.6 would be a waste of time, since my guess would be the plugin flow has changed.

I worked on and completed the first issue that was blocking #178 (re-indexable solr), and am planning on the second which is ~blocking #178 (#3317). I am also working on this current issue, because it addresses issues brought up in one of the community calls, it made a lot of people (myself included) excited, provides infrastructure which will allow for a whole host of features that will improve the user experience, because it allows us to take advantage of a librarian standard carefully curated data field, because it further tests the solr re-index flow, and because it showcases the importance of a re-indexable solr to people who might not realize how important it is. I apologize if that wasn't clearly communicated, but I wish you would be a little less quick to jump to accusations. We have been and are aligned on a lot of the same over all goals, and I am making progress towards them.

cdrini commented 4 years ago

I'm happy to accept #599 as a temporary patch until we get ICUFolding with solr 8; does that seem reasonable? I can include that in the May 1 solr reindex.

BrittanyBunk commented 4 years ago

@cdrini Maybe the notes can be fixed? Also, how come your trying to work #718 that's already setup and closed? I would like to setup an enthusiast and beta tester's (EBT) wiki, but it's difficult with the current situation lol. Guess I'll wait until the Solr reindex, data dump, and series are finished before starting. If everything's setup, newcomers can move onto the next steps with the tools they need and not worry about anything unnecessary.

It's really awesome to see everything come together, step-by-step. I would share my vision of the EBT page, but I didn't set it up yet. If anyone wants me to, let me know.

cdrini commented 4 years ago

Ahhh, I meant #178 :P Fixed. I'll fix the notes :+1:

BrittanyBunk commented 4 years ago

@cdrini Wow! Clear. I agree - it does make sense to move forward in order to address what's in the back to keep up. That's how I've done it in my life, so it works - I think you got a great plan and look forward to the changes :) Thanks for helping with the corrections.

cdrini commented 4 years ago

I created a new issue for getting the LCC class names from the LCC, since this current issue is going to be closed soon :) See #3396

cdrini commented 4 years ago

Closed by the PRs mentioned here. Unfortunately it's still living on dev.openlibrary.org , but here's a nice little demo: