internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

Normalize Dewey Decimal Class (DDC) format from MARC records (ie 082 field) #614

Open LeadSongDog opened 6 years ago

LeadSongDog commented 6 years ago

@EdwardBetts I think this is your balliwick:

We seem to have inconsistent representation of DDC. Sometimes it's just the base class, sometimes the extended one, sometimes it includes a trailing publication-year. Sometimes it embeds spaces, periods, apostrophes, and/or slashes in various combinations. Surely these should be standardized.

Segmentation of the DDC numbers (as recorded in the LoC data) is discussed at https://www.loc.gov/aba/dewey/segmentation.html and its MARC 21 representation at the 082 field is discussed at https://www.loc.gov/marc/bibliographic/bd082.html

The OL records representation of DDC is, it seems, taken from this by https://github.com/internetarchive/openlibrary/blob/c36d243cb552fa4bb463aaaa4bd4e70f1d80860a/openlibrary/catalog/marc/build_record.py#L203

and built into an edition record by https://github.com/internetarchive/openlibrary/blob/c36d243cb552fa4bb463aaaa4bd4e70f1d80860a/openlibrary/catalog/marc/build_record.py#L505

xayhewalo commented 4 years ago

@LeadSongDog Can you provide a URL showcasing poor representation of DDC?

Also, @seabelis I think your input would be valuable for this issue. Are you willing to be assignee? Note: the assignee in the triage stage is the person gathering information so that the eventual person who does the work is best enabled to do so,

From the Managed Labels Wiki

The assigned owner is not necessarily the person who will fix the issue (it is not necessarily even established, at that point, if or when the issue will be fixed at all), but rather they are the person who will do as much or as little as needed to handle the issue (asking questions, soliciting input, establishing and updating the priority, checking if it is a duplicate, etc).

Once an issue is labeled State: Work In Progress, the owner is the individual doing the work, or leading/coordinating the group that is doing the work.

I've added labels based on context: let me know your thoughts

LeadSongDog commented 4 years ago

@guyjeangiles Example at https://openlibrary.org/books/OL2589429M/The_crown_of_wild_olive Other editions mostly omit DDC while at least one has a different DDC entirely.

tfmorris commented 4 years ago

We don't use this field for anything currently, so this is purely a cosmetic (and thus low priority) issue.

What should the normalized format for an imported MARC 082 field look like?

seabelis commented 4 years ago

@guyjeangilles Yes, you can assign to me.