iuni-cadre / Collaborative-projects

For non-fellow collaborative projects on CADRE
9 stars 0 forks source link

BTAA data extraction for journal cost analysis #6

Closed everyxs closed 5 years ago

everyxs commented 5 years ago

Hi Val & Xiaoran,

The BTAA is requesting a subset of WoS data from our prototype of CADRE that has the following attributes (this is related to the use case we discussed on Friday):

Corresponding author institutional affiliation = University of Illinois OR Indiana University OR University of Iowa OR University of Maryland OR University of Michigan OR Michigan State University OR University of Minnesota OR University of Nebraska-Lincoln OR Northwestern University OR Ohio State University OR Pennsylvania State University OR Purdue University OR Rutgers University-New Brunswick OR University of Wisconsin-Madison

Fields should include: Title, Author, Publication Name, Publisher, ISSN, Year Published, DOI

Is it possible to extract this dataset from the WoS data as a csv? If so, are there any restrictions on use or distribution of the dataset? Is this something that is achievable in the next day or two?

Thank you so much,


Hi Matt,

Thank you for this explanation. Let’s go ahead and use the 2016 data. I heard back from the BTAA about our Clarivate contract, and we don’t need to worry about restrictions for sharing within the BTAA. We should also include University of Chicago in this dataset. So in summary, we need a subset of WoS in csv format that includes: • Corresponding author institutional affiliation = University of Illinois OR Indiana University OR University of Iowa OR University of Maryland OR University of Michigan OR Michigan State University OR University of Minnesota OR University of Nebraska-Lincoln OR Northwestern University OR Ohio State University OR Pennsylvania State University OR Purdue University OR Rutgers University-New Brunswick OR University of Wisconsin-Madison OR University of Chicago o These should be identified using the WoS vocabulary for institutions and limited to the specified campus, we don’t want IUPUI, for example, or UIC. When no campus is specified, assume the flagship campus. • Web of Science ID • Publication type • eISSN • Reprint Author [corresponding author) • OA status • Funding Acknowledgement text • Title • Author • Publication Name • Publisher • ISSN • Year Published • DOI



everyxs commented 5 years ago

Hi Jamie,

The data extraction is complete and you can find the results in /CADRE Project/BTAAwosQuery/BTAA_query_III.csv


We have included all papers with corresponding authors affiliated with any of the 14 university systems. We also provided additional information about the city location and full address which can help determine which campus within each university system the author is based.

Please let us know if you see any problem with the dataset.

Thank you!


Hi Xiaoran,

Here are the concerns wit the WoS dataset:

• !Also, Jamie, can you get the FU field pulled from the full records in the Clarivate data?!! The dataset you sent may not have had that because it was old. I know they made some change in how they handle funding information around 2015 (later for some of the social sciences and A&H indexes), although maybe that's just related to the ability to filter/search on "Funding Agencies". • Jamie - do you know how Xiaoran cleaned up the data? Mat's approximation was to use the WoS enhanced organization and email address. If people at IU Center for Global Health had @iu.edu email addresses (and they used their work email when submitting...some people seem to use their personal gmail address), then it would be caught. • What I downloaded has articles with publication date ranges from 1904 to 2004. • Jamie - can you also clarify where the data in Columns D, M, N, and O are from? I've been using the fields in the full record that are called RP and EM and I'm not sure how those are parsed out among the columns in the query file. Also, if we could get the print serial number (I think this is SN in the Web of Science full record), that could be useful.



everyxs commented 5 years ago

Hi Jamie,

Here is our updated output. The CSV file contains 1202378 papers, each with a uniquely assigned reprint address id. The choice of using reprint address ids instead of organization ids allows for a non-ambiguous assignment of each paper.

The columns are, with their corresponding Field Tags in order: WoSid, PT, FX, PY, EI, SN, DI, reprint author, TI, SO, AU, PU, reprint address city, RP, FU

These are the closet Field Tags we think, and according to Clarivate, the mapping from our XML data to the old filed tag system is not one-to-one. The new xml fields are much richer and they are retiring the tag system.

Please let me know if there is any other problems.

Thank you!


XiaoranYan commented 5 years ago

Thank you both, this is great. I just wanted to share the feedback I got from the BTAA when I sent this over, they seem very happy:


This is awesome! I haven’t received the Clarivate pull yet…so actually it will be interesting to compare.

We scoped (I think) Clarivate for only 2017 and your file looks like it is everything from 1900-2017.

You also provided all publishers and we asked Clarivate for Wiley only.

So I think this is the robust dataset everyone had hoped for.

We’ll just need to agree on the approach to narrowing the data to analyze.

I was able to open the file.

I think we can share this out and use. And we have lLarivate backup.

Many many thanks for persevering on our behalf.

You’ve just been terrific.



Jamie Wittenberg

Research Data Management Librarian

Head, Scholarly Communication Department

Indiana University Libraries

Herman B Wells Library E363

Bloomington, IN 47405

(812) 855-7769

