Add LCC and Dewey decimal numbers to solr in April solr reindex

cdrini commented 4 years ago

As discussed in the community call this past Tu, we would like to try implementing some sort of beta interface that lets users explore the LoC classification (or maybe dewey decimal) in openlibrary. See https://www.loc.gov/catdir/cpso/lcco/ . The first step of this would be to store the data into solr (which it currently isn't; e.g http://server.openjournal.foundation:8984/solr/select/?q=key%3A%2Fworks%2FOL3773057W&version=2.2&start=0&rows=10&indent=on vs https://openlibrary.org/books/OL2543776M/Course_Design ).

Describe the problem that you'd like solved

A way to perform range queries over LoC data; e.g.: loc:[BC1 TO BC199]; dewey_decimal:[070 TO 079]

Proposal & Constraints

Note these queries search by lexicographical sorting

Additional context

Stored on edition as lc_classifications and dewey_decimal_class

Stakeholders

@cclauss @finnless @tfmorris

cdrini commented 4 years ago

Sample data as stored in OL:

Key	dewey_decimal_class	Key	lc_classifications
OL1000071M	303.6/9	OL1025841M	HB1951 .R64 1995
OL1000602M	820.9/9287/0966	OL1025966M	DP402.C8 O46 1995
OL1000884M	863	OL1026156M	CS879 .R3 1995
OL1001241M	811/.54	OL1026211M	NC248.S22 A4 1992
OL1001366M	658.8	OL102629M	TJ563 .P66 1998
OL1001472M	330.94/051	OL1026596M	PQ3919.2.M2866 C83 1994
OL1001537M	635.9/67	OL1026624M	NA2500 .H64 1995
OL1001681M	651.8/536	OL1026668M	PN517 .L38 1994
OL1002024M	[Fic]	OL1026747M	MLCM 95/14118 (P)
OL1002068M	333.79/14	OL102706M	QA331.3 .M39 1998
OL1002147M	[Fic]	OL1027106M	PT8951.12.R5 M56 1980
OL1002396M	291.1/3	OL1027418M	MLCS 96/04520 (P)
OL1002411M	574.19/2/076	OL1027454M	HQ755.8 .T63 1995
OL100249M	342.73/087	OL1028019M	MLCS 97/02275 (T)
OL1002756M	741.5/973	OL1028055M	PZ70.C9 F657 1995
OL1003092M	[E]	OL1028253M	HC241 .G683 1995
OL1003197M	813/.54	OL1028626M	MLCS 95/08574 (U)
OL1004484M	398.24/528617	OL1028701M	HC371 .M45 nr. 122
OL1005872M	003	OL102878M	MLCS 2002/05802 (P)
OL1005937M	635/.0207	OL1029016M	IN PROCESS
OL1006217M	782.1/4/0973	OL102935M	KLA940 .K65 1990
OL1006312M	617.5/85	OL1029463M	KHA878 .G37 1996
OL1007188M	005.13/3	OL1029540M	KHH3003 .Q57 1995
OL1007427M	823/.8	OL1030429M	TX819.A1 T733 1991
OL1007504M	418	OL1030465M	PQ7298.12.A40 S26 1987
OL1007548M	747.7/97	OL1030780M	HM216 .G44 1993
OL1007742M	[E]	OL1030894M	SD409 .A38 1990
OL1008589M	332/.09172/6	OL1031493M	J451 .N4 1990z
OL1008660M	355/.0330536	OL1031615M	TR850 .F88 1993
OL1008703M	569/.67	OL1031659M	MLCS 93/14492 (P)
OL1008934M	297/.092/2	OL1031710M	KK2222 .L36 1993
OL1008978M	155.9/37	OL1031822M	G525 .M486 1991
OL1009410M	[E]	OL1032690M	HM261 .H47 1993
OL1009515M	570/.9	OL1032795M	PQ8098.23.O516 L38 1988
OL1009559M	953.63	OL1032953M	PL191 .I94 1992
OL1009793M	940.2/742/092	OL1033073M	LF3194.C65 A657 1992
OL1009887M	306.4/84	OL103482M	PG5438.V25 J47 1999
OL1009928M	808	OL1035916M	HG1615 .M32 1993
OL1009964M	730/.92	OL1036001M	KF27 .A3 1992h
OL1010856M	629.132/51	OL103608M	PT1937.A1 G35 1999
OL1011037M	304.2	OL1036126M	MLCS 98/02371 (H)
OL1011172M	613.7/042	OL1036553M	MLCM 93/05262 (D)
OL1011381M	158/.3	OL1036719M	KF3613.4 .C34
OL1011666M	133.5/946	OL1036755M	DR1313.3 .U54 1993
OL1011741M	363.739/463/09496	OL1037020M	DS557.8.M9 B55 1992b
OL1012371M	512.9/42	OL1037176M	DR82 .G46 1993
OL1012502M	289.9/4/0922	OL1037305M	PT2678.E3393 S36 1993
OL1013206M	363.2/2/02373	OL1037349M	HN530.2.A85 I86 1992
OL1013754M	863	OL1037631M	TK5105.5 .O653 1993
OL1013897M	937/.06	OL1038111M	AM79.5.B26 B34 1993

BrittanyBunk commented 4 years ago

Great starting point!

For (my personal) reference:

I notice some of these have [Fic] or [E]. Because of this being less reliable (like there are multiple E's), I'm assuming that the LOC would be the better ID to start with (i.e. more specific), especially since those books do have it. I'm assuming it'll be both for each edition in the end, so I won't worry.

tfmorris commented 4 years ago

This seems super low priority. I assume that the MANY other Solr bug fixes and feature requests will be waaayyy ahead of this.

BrittanyBunk commented 4 years ago

@tfmorris Here's my assumption of it: this first step is small, but what comes next is important when genres/sub-genres could be attached to editions - this'll clean up and help with the subject pages.

cclauss commented 4 years ago

@finnless suggested that I look at https://github.com/thisismattmiller/lcc-pdf-to-json which made it easy for me to create a lc_classifier function:

def lcc_to_subject(lcc: str) -> str:
    """
    >>> lcc_to_subject("ZA3201")
    'Information superhighway'
    """

The output could be Information resources (General): Information superhighway instead.

cclauss commented 4 years ago

lc_classification	clean	lc_subjects
HB1951 .R64 1995	HB1951	['Economic theory. Demography', 'Demography. Population. Vital events']
DP402.C8 O46 1995	DP402	['History of Spain', 'Local history and description', 'Other cities, towns, etc., A-Z']
CS879 .R3 1995	CS879	['Genealogy', 'By region or country']
NC248.S22 A4 1992	NC248	['Drawing. Design. Illustration', 'History of drawing']
TJ563 .P66 1998	TJ563	['Mechanical engineering and machinery', 'Steam engineering']
PQ3919.2.M2866 C83 1994	PQ3919	['French literature', 'Provincial, local, colonial, etc.']
NA2500 .H64 1995	NA2500	['Architecture', 'General works']
PN517 .L38 1994	PN517	['Literature (General)', 'Literary history', 'Collections']
MLCM 95/14118 (P)	MLCM95	[]
QA331.3 .M39 1998	QA331	['Mathematics', 'Analysis']
PT8951.12.R5 M56 1980	PT8951	['Norwegian literature', 'Individual authors or works', '1961-2000']
MLCS 96/04520 (P)	MLCS96	[]
HQ755.8 .T63 1995	HQ755	['The Family. Marriage. Women', 'The family. Marriage. Home', 'Eugenics']
MLCS 97/02275 (T)	MLCS97	[]
PZ70.C9 F657 1995	PZ70	['Fiction and juvenile belles lettres', 'Juvenile belles lettres']
HC241 .G683 1995	HC241	['Economic history and conditions', 'By region or country']
MLCS 95/08574 (U)	MLCS95	[]
HC371 .M45 nr. 122	HC371	['Economic history and conditions', 'By region or country']
MLCS 2002/05802 (P)	MLCS2002	[]
IN PROCESS	INPROCESS	[]
KLA940 .K65 1990	KLA940	['Russia, Soviet Union']
KHA878 .G37 1996	KHA878	['Argentina']
KHH3003 .Q57 1995	KHH3003	['Colombia']
TX819.A1 T733 1991	TX819	['Home economics', 'Cooking']
PQ7298.12.A40 S26 1987	PQ7298	['Spanish literature', 'Provincial, local, colonial, etc.', 'Spanish America']
HM216 .G44 1993	HM216	['Sociology', 'These are obsolete numbers no longer used']
SD409 .A38 1990	SD409	['Forestry', 'Sylviculture']
J451 .N4 1990z	J451	['General legislative and executive papers', 'Other regions and countries']
TR850 .F88 1993	TR850	['Photography', 'Cinematography. Motion pictures']
MLCS 93/14492 (P)	MLCS93	[]
KK2222 .L36 1993	KK2222	['Law of Germany', 'Commercial law', 'Commercial transactions', 'Banking. Stock exchange']
G525 .M486 1991	G525	['Geography (General)', 'Adventures, shipwrecks, buried treasure, etc.']
HM261 .H47 1993	HM261	['Sociology', 'These are obsolete numbers no longer used']
PQ8098.23.O516 L38 1988	PQ8098	['Spanish literature', 'Provincial, local, colonial, etc.', 'Spanish America']
PL191 .I94 1992	PL191	['Languages of Eastern Asia, Africa, Oceania', 'Ural-Altaic languages', 'Turkic languages']
LF3194.C65 A657 1992	LF3194	['Individual institutions', 'Germany']
PG5438.V25 J47 1999	PG5438	['Slavic. Baltic. Albanian', 'Slavic', 'Slovak']
HG1615 .M32 1993	HG1615	['Finance', 'Banking']
KF27 .A3 1992h	KF27	['Law of the United States (Federal)', 'Congressional documents']
PT1937.A1 G35 1999	PT1937	['German literature', 'Individual authors or works', '1700-ca. 1860/70', 'Goethe', 'Works']
MLCS 98/02371 (H)	MLCS98	[]
MLCM 93/05262 (D)	MLCM93	[]
KF3613.4 .C34	KF3613	['Law of the United States (Federal)', 'Social legislation', 'Social insurance']
DR1313.3 .U54 1993	DR1313	['History of Balkan Peninsula', 'Yugoslavia', 'History', 'By period', '1918-', 'Yugoslav War, 1991-1995']
DS557.8.M9 B55 1992b	DS557	['History of Asia', 'Southeast Asia', 'French Indochina', 'Vietnam. Annam', 'Vietnamese Conflict']
DR82 .G46 1993	DR82	['History of Balkan Peninsula', 'Bulgaria', 'History', 'By period', 'Turkish rule, 1396-1878']
PT2678.E3393 S36 1993	PT2678	['German literature', 'Individual authors or works', '1961-2000']
HN530.2.A85 I86 1992	HN530	['Social history and conditions. Social problems.', 'By region or country']
TK5105.5 .O653 1993	TK5105	['Electrical engineering. Electronics. Nuclear', 'Telecommunication']
AM79.5.B26 B34 1993	AM79	['Museums. Collectors and collecting', 'By country']

cdrini commented 4 years ago

@cclauss That looks awesome! Hmmm, we should display these on pages we have an LCC; maybe something like this?

Then once these are in solr, we can make each level clickable, leading to search page 😍 . But I don't think this needs to be blocked by that happening. They still add value + improve SEO even if they're just text!

cdrini commented 4 years ago

@tfmorris After all the work that went into #1067 , which largely blocked most modification to solr until completion (and even now it's still stuck in PR -_-). I wanted to work on something small, scoped, and impactful that takes advantage of/showcases our new super power (full reindexing!).

BrittanyBunk commented 4 years ago

@cdrini I spoke with @cclauss and thought up a kind of a new idea/way of thinking about it. We could do both my idea and your format, that's fine. I just want to say mine and how it'll look like with yours: Since not every LC classification led to a corresponding class in @cclauss's example, to make it be able to have every one have a corresponding class with it, here's my process: Use only the 1st 2 letters to generate genres: the 1st letter is the genre and the 2nd letter is the sub-genre. Using the entire list provided by @cclauss: https://www.questionpoint.org/crs/html/help/en/ask/ask_map_lcctoddc.html which would generate the output. Here's what it'd look like at the end (ignore the poor formatting):

cclauss commented 4 years ago

@BrittanyBunk and I slacked on this 12 hours ago and she proposed the same two-letter thing. The letters A, D, and J threw me because that table does not provide single-letter meanings but she provided them to me. So I will propose a new PR that shows us how to get the first three classifications so we see how it looks and then we can choose wether use just the two letters or letters plus numbers.

BrittanyBunk commented 4 years ago

@cclauss ok. I see why you're coming into issues. The site you showed me is incomplete (as it's used for dewey dec conversions, and dewey dec is not as robust as the LoC). The official one to use is complete. This equivalent should be the complete version to use (I'd just download it to a doc just in case it gets changed) (although it might need to be double checked just in case).

cclauss commented 4 years ago

AC --> General Works: Collections. Series. Collected works AE --> General Works: Encyclopedias

Long keys: [DAW, DJK, KBM, KBP, KBR, KBS, KBT, KBU, KD/KDK, KDZ, KJ-KKZ, KL-KWX, KU/KUQ]

That parses to 230 records:

{
  "A": "General Works",
  "AC": "Collections. Series. Collected works",
  "AE": "Encyclopedias",
  "AG": "Dictionaries and other general reference works",
  "AI": "Indexes",
  "AM": "Museums. Collectors and collecting",
  "AN": "Newspapers",
  "AP": "Periodicals",
  "AS": "Academies and learned societies",
  "AY": "Yearbooks. Almanacs. Directories",
  "AZ": "History of scholarship and learning. The humanities",
  "B": "Philosophy, Psychology, Religion",
  "BC": "Logic",
  "BD": "Speculative philosophy",
  "BF": "Psychology",
  "BH": "Aesthetics",
  "BJ": "Ethics",
  "BL": "Religions. Mythology. Rationalism",
  "BM": "Judaism",
  "BP": "Islam. Bahaism. Theosophy, etc.",
  "BQ": "Buddhism",
  "BR": "Christianity",
  "BS": "The Bible",
  "BT": "Doctrinal theology",
  "BV": "Practical Theology",
  "BX": "Christian Denominations",
  "C": "Auxiliary Sciences of History",
  "CB": "History of Civilization",
  "CC": "Archaeology",
  "CD": "Diplomatics. Archives. Seals",
  "CE": "Technical Chronology. Calendar",
  "CJ": "Numismatics",
  "CN": "Inscriptions. Epigraphy",
  "CR": "Heraldry",
  "CS": "Genealogy",
  "CT": "Biography",
  "D": "History, General and Old World",
  "DA": "Great Britain",
  "DAW": "Central Europe",
  "DB": "Czechoslovakia",
  "DC": "Monaco",
  "DD": "Germany",
  "DE": "Greco-Roman World",
  "DF": "Greece",
  "DG": "Malta",
  "DH": "Benelux Countries",
  "DJ": "Netherlands (Holland)",
  "DJK": "Eastern Europe (General)",
  "DK": "Poland",
  "DL": "Northern Europe. Scandinavia",
  "DP": "Portugal",
  "DQ": "Switzerland",
  "DR": "Balkan Peninsula",
  "DS": "Asia",
  "DT": "Africa",
  "DU": "Oceania (South Seas)",
  "DX": "Romanies",
  "E": "History of America",
  "F": "Local History of the United States and British, Dutch, French, and Latin America",
  "G": "Geography. Anthropology. Recreation",
  "GA": "Mathematical geography. Cartography",
  "GB": "Physical geography",
  "GC": "Oceanography",
  "GE": "Environmental Sciences",
  "GF": "Human ecology. Anthropogeography",
  "GN": "Anthropology",
  "GR": "Folklore",
  "GT": "Manners and customs (General)",
  "GV": "Recreation. Leisure",
  "H": "Social sciences",
  "HA": "Statistics",
  "HB": "Economic theory. Demography",
  "HC": "Economic history and conditions",
  "HD": "Industries. Land use. Labor",
  "HE": "Transportation and communications",
  "HF": "Commerce",
  "HG": "Finance",
  "HJ": "Public finance",
  "HM": "Sociology (General)",
  "HN": "Social history and conditions. Social problems. Social reform",
  "HQ": "The family. Marriage, Women and Sexuality",
  "HS": "Societies: secret, benevolent, etc.",
  "HT": "Communities. Classes. Races",
  "HV": "Social pathology. Social and public welfare. Criminology",
  "HX": "Socialism. Communism. Anarchism",
  "J": "Political science",
  "JA": "Political science (General)",
  "JC": "Political theory",
  "JF": "Political institutions and public administration",
  "JJ": "Political institutions and public administration (North America)",
  "JK": "Political institutions and public administration (United States)",
  "JL": "Political institutions and public administration (Canada, Latin America, etc.)",
  "JN": "Political institutions and public administration (Europe)",
  "JQ": "Political institutions and public administration (Asia, Africa, Australia, Pacific Area, etc.)",
  "JS": "Local government. Municipal government",
  "JV": "Colonies and colonization. Emigration and immigration. International migration",
  "JX": "International law, see JZ and KZ (obsolete)",
  "JZ": "International relations",
  "K": "Law",
  "KB": "Religious law in general. Comparative religious law. Jurisprudence",
  "KBM": "Jewish law",
  "KBP": "Islamic law",
  "KBR": "History of canon law",
  "KBS": "Canon law of Eastern churches",
  "KBT": "Canon law of Eastern Rite Churches in Communion with the Holy See of Rome",
  "KBU": "Law of the Roman Catholic Church. The Holy See",
  "KD/KDK": "United Kingdom and Ireland",
  "KDZ": "America. North America",
  "KE": "Canada",
  "KF": "United States",
  "KG": "West Indies. Caribbean area",
  "KH": "South America",
  "KJ-KKZ": "Europe",
  "KL-KWX": "Asia and Eurasia, Africa, Pacific Area, and Antarctica",
  "KU/KUQ": "Law of Australia and New Zealand",
  "KZ": "Law of nations",
  "L": "Education",
  "LA": "History of education",
  "LB": "Theory and practice of education",
  "LC": "Special aspects of education",
  "LD": "United States",
  "LE": "America (except United States)",
  "LF": "Europe",
  "LG": "Asia, Africa, Indian Ocean islands, Australia, New Zealand, Pacific islands",
  "LH": "College and school magazines and papers",
  "LJ": "Student fraternities and societies, United States",
  "LT": "Textbooks",
  "M": "Music",
  "ML": "Literature on music",
  "MT": "Instruction and study",
  "N": "Fine Arts",
  "NA": "Architecture",
  "NB": "Sculpture",
  "NC": "Drawing. Design. Illustration",
  "ND": "Painting",
  "NE": "Print media",
  "NK": "Decorative arts",
  "NX": "Arts in general",
  "P": "Language and Literature",
  "PA": "Greek language and literature. Latin language and literature",
  "PB": "Modern languages. Celtic languages and literature",
  "PC": "Romanic languages",
  "PD": "Germanic languages. Scandinavian languages",
  "PE": "English language",
  "PF": "West Germanic languages",
  "PG": "Slavic languages and literatures. Baltic languages. Albanian language",
  "PH": "Uralic languages. Basque language",
  "PJ": "Oriental languages and literatures",
  "PK": "Indo-Iranian languages and literatures",
  "PL": "Languages and literatures of Eastern Asia, Africa, Oceania",
  "PM": "Hyperborean, Native American, and artificial languages",
  "PN": "Literature (General)",
  "PQ": "Portuguese literature",
  "PR": "English literature",
  "PS": "American literature",
  "PT": "Swedish literature",
  "PZ": "Fiction and juvenile belles lettres",
  "Q": "Science",
  "QA": "Mathematics",
  "QB": "Astronomy",
  "QC": "Physics",
  "QD": "Chemistry",
  "QE": "Geology",
  "QH": "Biology",
  "QK": "Botany",
  "QL": "Zoology",
  "QM": "Human anatomy",
  "QP": "Physiology",
  "QR": "Microbiology",
  "R": "Medicine",
  "RA": "Public aspects of medicine",
  "RB": "Pathology",
  "RC": "Internal medicine",
  "RD": "Surgery",
  "RE": "Ophthalmology",
  "RF": "Otorhinolaryngology",
  "RG": "Gynecology and Obstetrics",
  "RJ": "Pediatrics",
  "RK": "Dentistry",
  "RL": "Dermatology",
  "RM": "Therapeutics. Pharmacology",
  "RS": "Pharmacy and materia medica",
  "RT": "Nursing",
  "RV": "Botanic, Thomsonian, and Eclectic medicine",
  "RX": "Homeopathy",
  "RZ": "Other systems of medicine",
  "S": "Agriculture",
  "SB": "Horticulture. Plant propagation. Plant breeding",
  "SD": "Forestry. Arboriculture. Silviculture",
  "SF": "Animal husbandry. Animal science",
  "SH": "Aquaculture. Fisheries. Angling",
  "SK": "Hunting",
  "T": "Technology",
  "TA": "Engineering Civil engineering (General).",
  "TC": "Hydraulic engineering. Ocean engineering",
  "TD": "Environmental technology. Sanitary engineering",
  "TE": "Highway engineering. Roads and pavements",
  "TF": "Railroad engineering and operation",
  "TG": "Bridges",
  "TH": "Building construction",
  "TJ": "Mechanical engineering and machinery",
  "TK": "Electrical engineering. Electronics. Nuclear engineering",
  "TL": "Motor vehicles. Aeronautics. Astronautics",
  "TN": "Mining engineering. Metallurgy",
  "TP": "Chemical technology",
  "TR": "Photography",
  "TS": "Manufacturing engineering. Mass production",
  "TT": "Handicrafts. Arts and crafts",
  "TX": "Home economics",
  "U": "Military Science",
  "UA": "Armies: Organization, distribution, military situation",
  "UB": "Military administration",
  "UC": "Military maintenance and transportation",
  "UD": "Infantry",
  "UE": "Cavalry. Armor",
  "UF": "Artillery",
  "UG": "Military engineering. Air forces",
  "UH": "Other military services",
  "V": "Naval Science",
  "VA": "Navies: Organization, distribution, naval situation",
  "VB": "Naval administration",
  "VC": "Naval maintenance",
  "VD": "Naval seamen",
  "VE": "Marines",
  "VF": "Naval ordnance",
  "VG": "Minor services of navies",
  "VK": "Navigation. Merchant marine",
  "VM": "Naval architecture. Shipbuilding. Marine engineering",
  "Z": "Bibliography. Library Science. Information resources",
  "ZA": "Information resources/materials"
}

BrittanyBunk commented 4 years ago

Cool! So now that we have that, we could use this for the DDC too! I tried to create an excel with LCC -> DDC and vice versa, but didn't get far enough. Maybe it could be coded, but here's the start: https://drive.google.com/file/d/1Yu-srlXD_FcUUTRV9lwseXrR7qEsNQ9a/view?usp=sharing

cdrini commented 4 years ago

UI-wise, let's start with just classes for now; I think having subjects, classes, genre, and sub-genre might be a little too much/confusing. I think the classes might be better since it also displays full granularity (so folks can dive in at any point they wish).

cclauss commented 4 years ago

Agreed. Let’s also get LC classes working smoothly & consistently before also doing DDC. The .pdf you added is great but highlights the complexity of getting it right.

cdrini commented 4 years ago

Baby steps :) To quote one of my new favourite laws (thanks @LeadSongDog !)

Gall's Law: A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

BrittanyBunk commented 4 years ago

@cdrini Just so we're all on the same page, you mean what's on https://www.loc.gov/catdir/cpso/lcco/ only right? Like https://openlibrary.org/books/OL103608M/Johann_Wolfgang_Goethe_Faust-Dichtungen. would show "Language and Literature" or what's in the image you posted?

cdrini commented 4 years ago

I mean the classes section displayed here: https://github.com/internetarchive/openlibrary/issues/3290#issuecomment-608143138

BrittanyBunk commented 4 years ago

@cdrini @cclauss So something like this then (I completed the full path) (*Ignore the fonts - just a mockup): Outline is just to use the name on the LoC site - but I'll research the correct term. @seabelis would you know by any chance?

cdrini commented 4 years ago

Ahhh, I see what you mean. Yes, exactly :+1:

BrittanyBunk commented 4 years ago

I'm thinking LCC titles, because the 'outline' is the name of the entire listing and we're just showing the individual titles of the LCC letters and numbers, but will defer to expert opinion if there is a name for this.

seabelis commented 4 years ago

Outline is just to use the name on the LoC site - but I'll research the correct term. @seabelis would you know by any chance?

Outline is the title of that specific page because the contents of the page is an outline of the LC Classification system.

Let's make the heading what it is "Library of Congress Classification" or "LC Classification." Would be nice to include the call number itself, similar to what is displayed on the LC catalog record as 'browse by shelf order" with the added functionality of the class names themselves as links. See https://catalog.loc.gov/vwebv/holdingsInfo?searchId=33155&recCount=25&recPointer=0&bibId=21468183, about half-way down the record.

BrittanyBunk commented 4 years ago

@seabelis thanks for helping me out.

ok. I think what you say could work, "Library of Congress Classification" or "LC Classification". I'm going to add "LCC" to the list to round it out.
I think if it's grouped together, it could go under the Classification section of the OL, just we'd need to change the call number to a different name, so they're not both the same: https://openlibrary.org/books/OL103608M/Johann_Wolfgang_Goethe_Faust-Dichtungen.

seabelis commented 4 years ago

You are showing this at the edition level. I understood this was going to be used at the work-level.

BrittanyBunk commented 4 years ago

I didn't see it being there unless we're calling it a 'genre'. However, since it's a classification based on a classification number of a book, this seems to be a better place. I will double check on this right now.

seabelis commented 4 years ago

I didn't see it being there unless we're calling it a 'genre'.

Why? This is not a genre. This is a classification.

BrittanyBunk commented 4 years ago

Exactly! That's why it goes underneath the classification tab, as each edition has a different call number. Even though most are in the same category as each other, this is just in case they aren't: https://openlibrary.org/books/OL24172776M/Hamlet and https://openlibrary.org/books/OL24614660M/Der_erste_deutsche_B%C3%BChnen-Hamlet - different editions, different call numbers. They're the same category, so I'll keep looking - as I want to be sure of where to place it - as you're right - it's unclear.

BrittanyBunk commented 4 years ago

I can't find one right now, so I wouldn't know where would be better - the works page or under Classifications.

If it's on the works page, it'll work by being with the subject tags
If it's under the Classifications, it'll be grouped with the rest of the LoC values.

cdrini commented 4 years ago

Yes, it is a little awkward since the LCC is stored on the edition (and can vary!). But I think we should display it on the work near the subjects section, because I think that will make it easier to find for non-librarian users. I chose the word "classes" instead of "Classifications" or "LCC" also hoping that might be easier for novice users. When DDC are eventually added, they can also appear in the "classes" section.

I think we need to wait for some ui demos to be implemented to see how these look and feel before we can make a final decision :)

BrittanyBunk commented 4 years ago

@cdrini Having it on the works page should be fine, as the LCC should be the same for all the editions - they all should be of the same topic.

I didn't use 'classes' as it's a combination of classes and subclasses (so I got confused), but I see what you mean. I'll wait until those are finished then before proceeding further.

tfmorris commented 4 years ago

@cdrini There are 48 open Solr bugs, some over a decade old. Do none of them meet your criteria?

If you want help choosing, I'll suggest #178 which is small, self-contained, HUGELY impactful, and just over a decade old, having been first reported March 13, 2010. By simply changing the definition of a single field, the author's name, users will now be able to find this record with 7 1 works for René-Aubert Vertot rather than this orphan with a single work when they search for Rene Vertot.

If you search for Renee Shann, you won't find ANY of the 107 works that OpenLibrary has cataloged.

If a librarian like @seabelis wanted to merge the 9 different Rene Char records, they'd need to search twice and then stitch the results together by hand.

In the face of all this, and they myriad other Solr issues, we're going to invent an entirely new, never before requested, issue to waste time on?

That is doing our patrons a HUGE disservice.

BrittanyBunk commented 4 years ago

@tfmorris I don't like getting involved in other people's discussions, but some things are important to say. This github issue that @cdrini's working on is something that's been going on for a while and requires a lot of people's help and right now's the moment that the resources are here. Also, doing this helps with future developments. It's an infrastructure that will make books easier to find - that includes the #178 you mentioned. Drini mentioned in the community meeting that reindexing the solr project is going to fix the inability to search by non-English characters, so it seems a little misguided. I would read up on the community meeting notes, especially 3-31-20 - which shows what I mean.

BrittanyBunk commented 4 years ago

Sorry @cdrini for continuing after you said to you wanted to focus on getting the indexing right, but since the labeling was discussed in the meeting, I just wanted to give another input on this: I realize that 'LCC titles' might be more appropriate than 'Library of Congress Classification'. The reason is that the call number is already called that on the OL and the LCCO page says that it's letters (and I'm assuming numbers) and 'titles' of an LC classification (first sentence) and the LoC calls the call number the LC classification (although they are inconsistent on some pages)." @seabelis @cclauss

cdrini commented 4 years ago

@BrittanyBunk Thanks Brittany! Reindexing is one of the blockers for allowing us to work on #178; it unfortunately doesn't impact the issue itself; that must've been a typo in the notes.

@tfmorris Yep; I'm aware of that issue. I believe updating to solr 8 (#3317) is more important (which is why I've also taken that up in my milestone for this month). Trying to fix #178 before #3317 would require investing time into installing solr 3.6 specific plugins / config, all of which would have to get redone once we do #3317. We've had this discussion before; one of my first PRs on openlibrary was a fix to #178, #599 ; So I'm fully aware of how important that issue is. We decided that although using ASCII Folding (which is what I did in #599 ) was an improvement, it wasn't that great for non-English languages, and that ICUFolding Filter (as you've done in your solr PR) was most correct ( See https://github.com/internetarchive/openlibrary/pull/599#issuecomment-345490579 ). This filter requires us to add plugins to solr (which I even did on a branch off #599). But I think adding plugins to solr in 3.6 would be a waste of time, since my guess would be the plugin flow has changed.

I worked on and completed the first issue that was blocking #178 (re-indexable solr), and am planning on the second which is ~blocking #178 (#3317). I am also working on this current issue, because it addresses issues brought up in one of the community calls, it made a lot of people (myself included) excited, provides infrastructure which will allow for a whole host of features that will improve the user experience, because it allows us to take advantage of a librarian standard carefully curated data field, because it further tests the solr re-index flow, and because it showcases the importance of a re-indexable solr to people who might not realize how important it is. I apologize if that wasn't clearly communicated, but I wish you would be a little less quick to jump to accusations. We have been and are aligned on a lot of the same over all goals, and I am making progress towards them.

cdrini commented 4 years ago

I'm happy to accept #599 as a temporary patch until we get ICUFolding with solr 8; does that seem reasonable? I can include that in the May 1 solr reindex.

BrittanyBunk commented 4 years ago

@cdrini Maybe the notes can be fixed? Also, how come your trying to work #718 that's already setup and closed? I would like to setup an enthusiast and beta tester's (EBT) wiki, but it's difficult with the current situation lol. Guess I'll wait until the Solr reindex, data dump, and series are finished before starting. If everything's setup, newcomers can move onto the next steps with the tools they need and not worry about anything unnecessary.

It's really awesome to see everything come together, step-by-step. I would share my vision of the EBT page, but I didn't set it up yet. If anyone wants me to, let me know.

cdrini commented 4 years ago

Ahhh, I meant #178 :P Fixed. I'll fix the notes :+1:

BrittanyBunk commented 4 years ago

@cdrini Wow! Clear. I agree - it does make sense to move forward in order to address what's in the back to keep up. That's how I've done it in my life, so it works - I think you got a great plan and look forward to the changes :) Thanks for helping with the corrections.

cdrini commented 4 years ago

I created a new issue for getting the LCC class names from the LCC, since this current issue is going to be closed soon :) See #3396

cdrini commented 4 years ago

Closed by the PRs mentioned here. Unfortunately it's still living on dev.openlibrary.org , but here's a nice little demo:

Library of Congress: https://dev.openlibrary.org/people/ScarTissue/lcc-list
Dewey Decimal: https://dev.openlibrary.org/people/ScarTissue/ddc-list

internetarchive / openlibrary