libraryhackers / library-callnumber-lc

Perl and Python modules for normalizing Library of Congress call numbers
30 stars 2 forks source link

Wrong sorting key produced for juvenile belle-lettres #9

Open gmcharlt opened 9 years ago

gmcharlt commented 9 years ago

Original issue 9 created by gmcharlt on 2015-03-06T20:30:10.000Z:

What steps will reproduce the problem?

  1. Input records with these call numbers: PZ7.M3567585 Bs 1997x PZ7.M3567585 Km 1997 PZ7.M3567585 Mh 1997x PZ7.M3567585 Stp 1997x PZ7.M3567585 Sx 1998 PZ7.M3567585 Tr 1986 PZ7.M3567585 Wel 1995x
  2. The records will be sorted in this order: PZ7.M3567585 Stp 1997x PZ7.M3567585 Wel 1995x PZ7.M3567585 Bs 1997x PZ7.M3567585 Km 1997 PZ7.M3567585 Mh 1997x PZ7.M3567585 Sx 1998 PZ7.M3567585 Tr 1986

What is the expected output? What do you see instead? The correct sort is PZ7.M3567585 Bs 1997x PZ7.M3567585 Km 1997 PZ7.M3567585 Mh 1997x PZ7.M3567585 Stp 1997x PZ7.M3567585 Sx 1998 PZ7.M3567585 Tr 1986 PZ7.M3567585 Wel 1995x

What version of the product are you using? On what operating system? version 0.23 on Windows 7 Professional with Strawberry perl

Please provide any additional information below.

Following is the sort key followed by the input text for the above sample: PZ0007 M3567585 STP 01997X PZ7.M3567585 Stp 1997x PZ0007 M3567585 WEL 01995X PZ7.M3567585 Wel 1995x PZ0007 M3567585 B S 01997X PZ7.M3567585 Bs 1997x PZ0007 M3567585 K M1997 PZ7.M3567585 Km 1997 PZ0007 M3567585 M H 01997X PZ7.M3567585 Mh 1997x PZ0007 M3567585 S X1998 PZ7.M3567585 Sx 1998 PZ0007 M3567585 T R1986 PZ7.M3567585 Tr 1986

There are three distinct treatments of the alphabetic cutter: three (or more) alphabetic characters are read as a group (with a normalized year data following); two alphabetic characters followed by a four-digit year are read as a single character, followed by the second character prepended to the four digit data; two alphabetic characters followed by a four-digit year with an "x" are read as three separate elements: the first character, the second character, and the normalized date. The three (or more) character cutter is also treated differently, in that it is preceded by two spaces, whereas the two-character cutter is preceded by one space.

The correct sort keys for each follows: PZ7.M3567585 Bs 1997x PZ0007 M3567585 BS 01197X PZ7.M3567585 Km 1997 PZ0007 M3567585 KM 01997 PZ7.M3567585 Mh 1997x PZ0007 M3567585 MH 01997X PZ7.M3567585 Stp 1997x PZ0007 M3567585 STP 01997X PZ7.M3567585 Sx 1998 PZ0007 M3567585 SX 01998 PZ7.M3567585 Tr 1986 PZ0007 M3567585 TR 01986 PZ7.M3567585 Wel 1995x PZ0007 M3567585 WEL 01995X

The same type of alphabetic cuttering can be found in PZ3 and PZ4.

gmcharlt commented 9 years ago

Comment #1 originally posted by gmcharlt on 2015-03-07T14:27:34.000Z:

The second cutters in PZ3 (and PZ4) can optionally have another digit as part of the cutter. Here is a portion of the Library of Congress's shelflist for Agatha Christie (notice that in one case there is even another alphabetic character following the digit):

PZ3.C4637 Aac 1977 PZ3.C4637 Ab PZ3.C4637 Ab2 PZ3.C4637 Ab6 PZ3.C4637 An PZ3.C4637 Bi PZ3.C4637 Bi2 PZ3.C4637 Bi2 PZ3.C4637 Bi2a PZ3.C4637 Bl PZ3.C4637 Dei PZ3.C4637 Dei8 PZ3.C4637 Deo PZ3.C4637 Deo2 PZ3.C4637 Des PZ3.C4637 Mp PZ3.C4637 Mp11 PZ3.C4637 Mpm PZ3.C4637 Mr PZ3.C4637 Mr15 PZ3.C4637 Mr15 PZ3.C4637 Mr23 PZ3.C4637 Mr4 PZ3.C4637 Mr6 PZ3.C4637 Mr6 PZ3.C4637 Mt PZ3.C4637 My PZ3.C4637 My10 PZ3.C4637 My11 PZ3.C4637 My3 PZ3.C4637 Myq PZ3.C4637 Mys PZ3.C4637 Mys 3 PZ3.C4637 Pas PZ3.C4637 Pas 1970b PZ3.C4637 Poi PZ3.C4637 Poig 1977 PZ3.C4637 Poj3

I have copied this in the order in which LC's catalog displays the numbers. The order for the "...Mr" and "...My" numbers is actually incorrect. The digit following the cutter should be treated as an integer, so that the correct order is

PZ3.C4637 Mr PZ3.C4637 Mr4 PZ3.C4637 Mr6 PZ3.C4637 Mr6 PZ3.C4637 Mr15 PZ3.C4637 Mr15 PZ3.C4637 Mr23

and

PZ3.C4637 My PZ3.C4637 My3 PZ3.C4637 My10 PZ3.C4637 My11