umd-coding-workshop / website

Hacking the Shell
https://github.com/umd-coding-workshop/website/wiki
29 stars 6 forks source link

Find subject categories for a list of ISSNs and add them to a spreadsheet #22

Open spurioso opened 11 years ago

spurioso commented 11 years ago

This is something that came up in a meeting of UMD selectors and collection management people last week.

The Libraries will be doing a survey with faculty to determine how important our serials subscriptions are.

Faculty will be asked to look at a list of our subscriptions and rank them in some way (i.e., Can NEVER cancel, could do without, etc.).

We'll be using some kind of tool developed by NC State for this. I can add the link when I find it.

The lists of subscriptions will be sorted by fund code only. So, all the music journals will be listed on one spreadsheet. This includes academic musicology journals, trade magazines for instrumentalists, music education journals, and so on.

At the meeting, some selectors expressed a desire to further subdivide the lists. So, for me, maybe I would want to separate out the music education journals from the music theory journals, since a music theorist might not be interested at all in music education. The situation is maybe even more acute for something like English literature that has several hundred subscriptions.

So, I'm wondering if there's a way to take the list of ISSNs that will be in the spreadsheet, feed them through something and have the something spit back out useful subject categories. I know this is possible with Aleph or Worldcat. I'm also wondering if Ulrich's offers a Web service that might do something similar, maybe with simpler subcategories. Maybe some other options out there too.

jwestgard commented 11 years ago

Steve,

I think you could pretty easily write a python script to loop through a list of ISBNs and query LOC's catalog API for each one. Then, you could use the 're' (regular expressions) python module to parse the resulting records and locate a particular field or string.

The question is, is there a single field that you could expect to find in each record that would be useful for sorting the journals in question? Subject fields are the sort of thing you're looking for, I think, but the question is how do you deal with results that return multiple subject fields for a given record? Maybe some MARC specialists could point us in the right direction. Is there a single field we could look for in the records retrievable from LOC's catalog that would allow us to sort journals into meaningful sub-categories?

Here's more about LOC's API: http://www.loc.gov/standards/sru/

A sample query is: http://z3950.loc.gov:7090/voyager?version=1.1&operation=searchRetrieve&query=0596002815&maximumRecords=1&recordSchema=dc

The bit after "&query=" is where you could put in the ISBN/ISSN. This query asks for one result only, and that the result be Dublin Core metadata. I put in the ISBN for Mark Lutz's Learning Python book.

I have a bit of Python that you can use to query a website. It requires a module called url, but it is pretty easy to work with. I will post it on Github and put a link here.

Josh

lseguin commented 11 years ago

Because of their specificity, LC subject headings are not particularly useful for categorizing things. LC Classification numbers are better (you would have to translate them into words), but I don't see one in the DC output for Josh's record below.

Linda


From: Joshua Westgard [notifications@github.com] Sent: Wednesday, August 28, 2013 4:16 PM To: umd-coding-workshop/website Subject: Re: [website] Find subject categories for a list of ISSNs and add them to a spreadsheet (#22)

Steve,

I think you could pretty easily write a python script to loop through a list of ISBNs and query LOC's catalog API for each one. Then, you could use the 're' (regular expressions) python module to parse the resulting records and locate a particular field or string.

The question is, is there a single field that you could expect to find in each record that would be useful for sorting the journals in question? Subject fields are the sort of thing you're looking for, I think, but the question is how do you deal with results that return multiple subject fields for a given record? Maybe some MARC specialists could point us in the right direction. Is there a single field we could look for in the records retrievable from LOC's catalog that would allow us to sort journals into meaningful sub-categories?

Here's more about LOC's API: http://www.loc.gov/standards/sru/

A sample query is: http://z3950.loc.gov:7090/voyager?version=1.1&operation=searchRetrieve&query=0596002815&maximumRecords=1&recordSchema=dc

The bit after "&query=" is where you could put in the ISBN/ISSN. This query asks for one result only, and that the result be Dublin Core metadata. I put in the ISBN for Mark Lutz's Learning Python book.

I have a bit of Python that you can use to query a website. It requires a module called url, but it is pretty easy to work with. I will post it on Github and put a link here.

Josh

— Reply to this email directly or view it on GitHubhttps://github.com/umd-coding-workshop/website/issues/22#issuecomment-23444008.

jwestgard commented 11 years ago

Linda, maybe this is better?

http://z3950.loc.gov:7090/voyager?version=1.1&operation=searchRetrieve&query=0596002815&maximumRecords=1&recordSchema=marcxml

I only changed the very end of the query string, from 'dc' to 'marcxml'.

There might be another option that's better. For a full list of the schemata available via this service, see: http://www.loc.gov/standards/sru/resources/schemas.html

lseguin commented 11 years ago

Yes. Datafield tag "050", subfield code "a" is the classification number.

jwestgard commented 11 years ago

Cool. Thanks, Linda! So I think it would be pretty easy to take the list of ISBNs, query the LOC for the marcxml record for each one, and use regular expressions to pull out contents of the 050a field from each record, and write it into a spreadsheet next to the ISSN. The question is, Steve, is that the sort of thing you were after?

spurioso commented 11 years ago

Thanks, Josh and Linda. Yes, this is more or less what I'm after. Deciding the data source will be the key I think. For music, LC Class won't work well. Most of the music journals fall into ML1 (for music journals published in the U.S.) or ML5 (for journal published elsewhere) so there isn't much granularity there. A few of them fall into specific classifications, like ML410 for journals devoted to a single composer. Still, just within ML5 you might have a journal on Medieval music and one on 20th century music theory. LC Class might work better for other disciplines, though.

As Linda mentioned, LCSH would probably be problematic too, because they're so specific. However, we could try it by looking for 650 fields and then the word "Periodicals," which is used a subdivision. Worth trying.

I was hoping that either EBSCOnet or Ulrich's would have it's own taxonomy but I just checked both and they seem to use LC Class and Dewey. I don't know if Dewey would be useful.

Hmm...

spurioso commented 11 years ago
  1. This is a project that is already in the works.
  2. Could be a good example of how code can help subject specialists do their jobs better.
  3. Cross-departmental buy-in.
spurioso commented 11 years ago
  1. Dependencies: need to fit subjects into existing application.
  2. Need to find a good useable list of subjects (LCSH? LCCS? Ulrich's)
  3. How we get ISSNs? Report in Aleph? Will run by Acquisitions/Collection Management...
lseguin commented 11 years ago

If you’re going to get a report from Aleph for the ISSNs, then you could get the classification numbers in the same report. Then your challenge would be writing code that would convert the class numbers to their subject categories.

Linda

From: Steve Henry [mailto:notifications@github.com] Sent: Thursday, November 07, 2013 9:26 AM To: umd-coding-workshop/website Cc: Linda Seguin Subject: Re: [website] Find subject categories for a list of ISSNs and add them to a spreadsheet (#22)

  1. Dependencies: need to fit subjects into existing application.
  2. Need to find a good useable list of subjects (LCSH? LCCS? Ulrich's)
  3. How we get ISSNs? Report in Aleph? Will run by Acquisitions/Collection Management...

— Reply to this email directly or view it on GitHubhttps://github.com/umd-coding-workshop/website/issues/22#issuecomment-27968464.

spurioso commented 11 years ago

Thanks, @lseguin!

I wonder if LC's Linked Data service might be useful:

Here's their record for books or journals about jazz: http://id.loc.gov/authorities/classification/ML3505.8-ML3509.html

spurioso commented 11 years ago

This project now has a repo! https://github.com/umd-coding-workshop/journal-review