sul-dlss / sul_pub

SUL system for harvest and managing publications for Stanford CAP, with controlled API access.
http://cap.stanford.edu
Other
8 stars 3 forks source link

WOS record data has odd formatting inside the data itself, consider reformatting of to remove inline spaces, returns, etc #552

Open dazza-codes opened 6 years ago

dazza-codes commented 6 years ago

There are various issues with the WOS record data formatting when it is placed in the pub_hash, e.g. for the MEDLINE:26776186 record, there are additional spaces and new lines.

:abstract=>
  "\n              Automatically data-mining clinical practice patterns from\n              electronic health records (EHR) can enable prediction of future\n              practices as a form of clinical decision support (CDS). Our\n              objective is to determine the stability of learned clinical\n              practice patterns over time and what implication this has when\n              using varying longitudinal historical data sources towards\n              predicting future decisions. We trained an association rule engine\n              for clinical orders (e.g., labs, imaging, medications) using\n              structured inpatient data from a tertiary academic hospital.\n              Comparing top order associations per admission diagnosis from\n              training data in 2009 vs. 2012, we find practice variability from\n              unstable diagnoses with rank biased overlap (RBO)<0.35 (e.g.,\n              pneumonia) to stable admissions for planned procedures (e.g.,\n              chemotherapy, surgery) with comparatively high RBO>0.6.\n              Predicting admission orders for future (2013) patients with\n              associations trained on recent (2012) vs. older (2009) data\n              improved accuracy evaluated by area under the receiver operating\n              characteristic curve (ROC-AUC) 0.89 to 0.92, precision at ten\n              (positive predictive value of the top ten predictions against\n              actual orders) 30% to 37%, and weighted recall (sensitivity) at\n              ten 2.4% to 13%, (P<10(-10)). Training with more longitudinal\n              data (2009-2012) was no better than only using recent (2012) data.\n              Secular trends in practice patterns likely explain why smaller but\n              more recent training data is more accurate at predicting future\n              practices.\n            ",
:title=>"\n          DYNAMICALLY EVOLVING CLINICAL PRACTICES AND IMPLICATIONS FOR\n          PREDICTING MEDICAL DECISIONS.\n        ",

We do not control what WOS puts into the XML. Currently, the XML element text is mapped to the pub_hash without reformatting it.

peetucket commented 6 years ago

Can we at least run .strip on the value of the node so the leading/trailing whitespace and returns are removed --- not sure about the inline stuff but hopefully that is less common in titles anyway.

Created #553 for this simple first implementation