really incorrect match_affil results

simonatdrg commented 6 years ago

I created a set of affiliation strings from Pubmed abstracts which all include 'Harvard' (University, medical school, etc) and ran them through match_affil, After downloading the most recent grid.csv dataset (which has many entries for Harvard, including 'Harvard medical school'). The script, input file and results are attached.

You'll see that there are very few matches, and of those quite a few are incorrect. II'm not an expert on the machine learning techniques involved - can you explain and possibly suggest ways to improve these results ? Zip file with sccript, input data and outputs attached.

matchaffil_test.zip

titipata commented 6 years ago

Hi @simonatdrg, yeah, totally agree I was hard coding Harvard when parsing it.

I'll take a look at it and fix it by following weeks! (sorry, I'm a little busy for this week)

simonatdrg commented 6 years ago

Great – I look forward to it.

Here’s some background. We’re trying to use Pubmed affiliation data as an aid to researcher name disambiguation (not just across Pubmed author names, but incorporating other sources such as Physician lists and clinical trial data). We also will use things like associated MeSH terms for a publication to refine the matching (i.e. a name associated with Cardiac disease will most likely not be a match to someone with the same name but associated with dermatology publications, even if they are both working at the same organization).

Harvard is a good (and extreme) test case, as authors may have multiple affiliations ( University / medical school / institute / teaching hospital and one or more of these can occur in affiliation strings. I was attracted to the organization hierarchy present in the Grid dataset as a way to handle these.

Regards

-Simon

From: Titipat Achakulvisut notifications@github.com Reply-To: titipata/affiliation_parser reply@reply.github.com Date: Thursday, March 15, 2018 at 10:54 AM To: titipata/affiliation_parser affiliation_parser@noreply.github.com Cc: "Rosenthal, Simon" srosenthal@teamdrg.com, Mention mention@noreply.github.com Subject: Re: [titipata/affiliation_parser] really incorrect match_affil results (#11)

Hi @simonatdrghttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsimonatdrg&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830440528&sdata=jEp6EA8PwlEDqGlOy0fAYKAxKnBgON7nuLRdKAzJ%2B6g%3D&reserved=0, yeah, totally agree I was hard coding Harvard when parsing it.

I'll take a look at it and fix it by following weeks! (sorry, I'm a little busy for this week)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftitipata%2Faffiliation_parser%2Fissues%2F11%23issuecomment-373404088&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830450533&sdata=%2FMsU%2Fw5UuofFWdl3K8m1F2Rolgt2IoMbTk8K7AciLb0%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAGd_lhBu-4eU2CqO2Trwl8ZAi4sYLpEiks5teoCjgaJpZM4SsNj5&data=02%7C01%7Csrosenthal%40teamdrg.com%7Cf304f1e7c6ec4b58342408d58a84aee9%7C5d6495b15cd44a4fa6dd1f5f3bf58831%7C0%7C0%7C636567224830450533&sdata=NSYM3Fu63S3afnvB8udXrAIerqpiieBEWTGhNuf9%2Brc%3D&reserved=0.

titipata commented 5 years ago

Hi @simonatdrg, sorry for the really late reply. I will work on this issue over the weekend. Hopefully will solve most issues here.

fangzhou-xie commented 4 years ago

Hi, currently I have similar questions. For example when matching "Stanford University", it shows:

(Pdb++) match_affil('Stanford University')
OrderedDict([('ID', 'grid.440952.e'), ('Name', 'University of Belize'), ('City', 'Belmopan'), ('State', ''), ('Country', 'Belize')])

I have tried some other well-known universities (Harvard, Princeton, NYU, Columbia, Caltech, etc) but the match_affil works just fine. I wonder if this result could be improved somehow?

Thank you!

titipata commented 4 years ago

Hi @mark-fangzhou-xie, yeah, I wrote this library a while ago and I really need to update the code on this repo. Currently, the matching is done based on the nearest neighbor algorithm. I will try to improve it over the next month if I have time.

titipata / affiliation_parser

really incorrect match_affil results #11