sul-dlss-deprecated / rialto-etl

ETL tools for RIALTO, Stanford Libraries' research intelligence project
https://library.stanford.edu/projects/rialto
Apache License 2.0
3 stars 0 forks source link

Investigate the topics/keyword values in WoS data to see what's there #86

Closed peetucket closed 5 years ago

peetucket commented 5 years ago

How consistent is it? This will drive the topic area selectors in the reports

mjgiarlo commented 5 years ago

http://ipscience-help.thomsonreuters.com/wosWebServicesExpanded/appendix1Group/ascaCategories/version/1 http://ipscience-help.thomsonreuters.com/wosWebServicesExpanded/8007-TRS.html?branch=authorityFile http://ipscience-help.thomsonreuters.com/wosWebServicesExpanded/8010-TRS.html?branch=authorityFile

$ cat spec/fixtures/wos/000424386600014.json | jq '.static_data.fullrecord_metadata.category_info.subjects.subject | .[] | select(.ascatype == "extended").content' | sort | uniq

"Biochemistry & Molecular Biology"
"Biotechnology & Applied Microbiology"
"Chemistry"
"Computer Science"
"Genetics & Heredity"
"Mathematical & Computational Biology"
"Mathematics"
"Oncology"
"Pharmacology & Pharmacy"
"Research & Experimental Medicine"
"Science & Technology - Other Topics"

$ cat spec/fixtures/wos/000424386600014.json | jq '.static_data.fullrecord_metadata.category_info.subjects.subject | .[] | select(.ascatype == "traditional").content' | sort | uniq

"Biochemical Research Methods"
"Biotechnology & Applied Microbiology"
"Chemistry, Medicinal"
"Chemistry, Multidisciplinary"
"Computer Science, Information Systems"
"Computer Science, Interdisciplinary Applications"
"Genetics & Heredity"
"Mathematical & Computational Biology"
"Medicine, Research & Experimental"
"Multidisciplinary Sciences"
"Oncology"
"Pharmacology & Pharmacy"
"Statistics & Probability"

$ cat spec/fixtures/wos/000424386600014.json | jq '.static_data.fullrecord_metadata.category_info.subheadings.subheading' | sort | uniq

"Life Sciences & Biomedicine"
"Physical Sciences"
"Technology"

$ cat spec/fixtures/wos/000424386600014.json | jq '.static_data.fullrecord_metadata.category_info.headings.heading' | sort | uniq

"Science & Technology"

$ cat spec/fixtures/wos/000424386600014.json | jq '.static_data.fullrecord_metadata.keywords.keyword[]' | sort | uniq

"Amino acid similarities"
"Biomarkers"
"Convolutional neural network"
"Deep learning"
"Disease"
"Drug"
"drug discovery"
"functional genomics"
"Gene"
"mechanisms"
"molecular"
"Mutation analysis"
"Pharmacogenetics"
"Pharmacogenomics"
"PharmGKB"
"Polymorphism"
"precision medicine"
"Protein structural analysis"
"Structural bioinformatics"
"systems biology"
"Tamoxifen"
"therapeutics"
"Variant"
"Warfarin"

$ cat spec/fixtures/wos/000424386600014.json | jq '.static_data.item.keywords_plus.keyword[]' | sort | uniq

"ADJUVANT TAMOXIFEN"
"ALZHEIMERS-DISEASE"
"ANNOTATION"
"BACTERIOPHAGE-T4 LYSOZYME"
"BIOLOGY"
"BOLTZMANN MACHINES"
"CANCER"
"CHEMOINFORMATICS"
"CLASSIFICATION"
"CONNECTIVITY MAP"
"DATABASE"
"DISEASE"
"DNA MICROARRAY"
"DOXORUBICIN"
"DRUG"
"DRUG DISCOVERY"
"DRUG-SENSITIVITY"
"ENHANCED PROTEIN THERMOSTABILITY"
"FACTOR-BINDING"
"FINGERPRINTS"
"GENE-EXPRESSION"
"GENE-EXPRESSION DATA"
"GENE-EXPRESSION PROFILES"
"GENOME BROWSER"
"HYDROPHOBIC CORE"
"INFORMATION"
"JOHNSON SYNDROME"
"MICROARRAYS"
"MUTATIONS"
"ONTOLOGY TERMS"
"PERFORMANCE"
"PERSONALIZED MEDICINE"
"PHARMACOGENETICS"
"PHARMACOGENETICS IMPLEMENTATION CONSORTIUM"
"PHARMACOGENOMICS KNOWLEDGE-BASE"
"POPULATION"
"PRECISION MEDICINE"
"PREDICTION"
"PRINCIPLES"
"PROFILES"
"PROTEINS"
"REGIONS"
"REGRESSION"
"RESOLUTION"
"SACCHAROMYCES-CEREVISIAE"
"SELECTION"
"SENSITIVITY"
"SEQUENCES"
"SHOCK TRANSCRIPTION FACTOR"
"STABILITY"
"STRUCTURAL-ANALYSIS"
"SUPPORT VECTOR MACHINE"
"SURROGATE END-POINTS"
"SURVIVAL"
"SYSTEMS PHARMACOLOGY ANALYSIS"
"T4 LYSOZYME"
"TEMPERATURE-SENSITIVE MUTANT"
"TOOL"
"TOXIC EPIDERMAL NECROLYSIS"
"UCSC"
"UK BIOBANK"
"WHOLE-GENOME"
"YEAST"
mjgiarlo commented 5 years ago

@peetucket Using our (admittedly small batch of) fixture data, I pulled out topic-like data and put that in the prior comment. Would you mind reviewing these and suggesting which one looks the most promising?

peetucket commented 5 years ago

The "extended" subject types seem like the best option to me - one thing that I wondering about is if these are the at the level of the journal (from their documentation, it appears that they are). In other words, each publication may have one, but it will be the same for every publication associated with that journal. I don't think we need to change our logic, but it is interesting to note. Is that your interpretation as well?

mjgiarlo commented 5 years ago

That is my interpretation as well, @peetucket. Let's go with extended subject types, then.

mjgiarlo commented 5 years ago

This is now done, and documented in the publication mapping: https://github.com/sul-dlss-labs/rialto-etl/wiki#publications-wosweb-of-science-mapping