pgarrett-scripps / PubmedCoauthorStreamlitApp

0 stars 0 forks source link

Missing spaces in some affiliations #1

Closed magnuspalmblad closed 7 months ago

magnuspalmblad commented 7 months ago

It appears the affiliations in some PubMed entries lack spaces between cities, states and postal codes, e.g. Institute for Systems Biology SeattleWashington98109 USA. in https://pubmed.ncbi.nlm.nih.gov/37969874/. Perhaps a geoparsing library could be used to clean up these and other, similar, errors in the PubMed affiliations?

pgarrett-scripps commented 7 months ago

Hi Magnus,

Thanks for bringing this to my attention. I just added a few additional options to help with data clean up overall. The 'Split Camel Case' option should address most of the condensed affiliations you are encountering. Its regex based so not a perfect solution but it appears to work well enough.

I'll look into a geoparsing library, but I fear the affiliation strings may just be too inconsistent.

Let me know if you encounter any more issues.

Best, Patrick