sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

Normalize and make consistent how name/alternate institution queries are crafted in both WoS and Pubmed #1062

Open peetucket opened 5 years ago

peetucket commented 5 years ago

Currently the alternate institution lists need to be edited before crafting the query (to remove things like &, university, etc.). This is now done differently in both WoS vs Pubmed. We also have a different way of creating (or not) alternate naming variants to send to the query. We may want to create methods in a consistent way to do this for both harvesters if possible.

peetucket commented 5 years ago

see https://github.com/sul-dlss/sul_pub/pull/1060 for the work that added pubmed query editing

peetucket commented 5 years ago

For example, in addition to what we do in the https://github.com/sul-dlss/sul_pub/blob/master/lib/pubmed/query_author.rb class, it appears we already have some code that is doing something similar for the WoS search, this class: https://github.com/sul-dlss/sul_pub/blob/master/lib/agent/author_institution.rb

It is stripping things like "and" and "university". It is used here to construct a list of institutions to add to the query:

https://github.com/sul-dlss/sul_pub/blob/master/lib/web_of_science/query_author.rb#L40-L42

We also end up creating name variants in https://github.com/sul-dlss/sul_pub/blob/master/lib/agent/author_name.rb that is used in the WoS queries, that we don't take advantage of in the Pubmed queries.

It would be nice to use the classes in lib/agent for both WoS and Pubmed for consistency.

Thoughts on re-using this logic? Note that the reason we ended up stripping "University" and "Institution" and "College" in WoS queries is I believe for a similar reason (it was picking up extra stuff), which is perhaps not a problem for Pubmed. But wanted to acknowledge a bit of duplication here for consideration.

author=Author.find(37959)
WebOfScience::QueryAuthor.new(author).send(:institutions)
=> ["stanford", "oregon health & science", "washington"]