Not Finding People - Marcia, Leandro

vikrammanikantan commented 2 months ago

Mailer not finding or not matching the following people:

Marcia Rieke (on the Arizona website she has the middle initial "J", which might be confusing the code)
Leandro Beraldo e Silva. His name is exactly the same on the website, I don't know why the mailer is not finding it

lwhitler commented 2 months ago

Two things seem to be at work here, both related to how the mailer parses middle initials. Or more generally, names with multiple spaces in them.

Marcia: issue is indeed because she has the middle initial "J" on the website. The directory that is built from scraping the website considers all of "Marcia J" to be her first name. When the author list of the paper is parsed and each name is later checked against the directory (approximate_name_lookup), it considers the case when the directory name is a subset of the author list name but not vice versa. i.e. a person can use an initial in the author list when it isn't in the directory but not the other way around.
Leandro: when building the directory, the name is split as "Leandro" being the first name and "Beraldo e Silva" as the surname. When parsing the tex file of the paper, "Leandro Beraldo e" is the first name and "Silva" is the last name.

My overall feeling on how to fix: homogenize the way that names are split in the directory and in papers rather than adding more cases to check. The sensible human thing is probably to split after the first space, though need to be careful then about last names (if "J Rieke" becomes Marcia's entire last name in the directory, it needs to work when she publishes as Marcia Rieke and her surname it gets parsed as "Rieke"). So maybe split after last space? Or somehow separate out the middle parts of the name entirely?

Unsure yet how to achieve this, though parsing the tex file uses regex and the directory does something else; use regex for both?

vikrammanikantan commented 2 months ago

Two comments:

We can just split by all spaces, and then take the first element and the last element to be the first and last name, respectively.
I am a little confused. It sounds like the behavior is different in both cases. For Marcia, everything but the last name is considered the first name. But for Leandro, everything by the first name is considered the last name? In other words, their names being split at different locations, which does not seem good.

lwhitler commented 2 months ago

I don't think that will work for people who go by MiddleName LastName in the directory, but publish as e.g. FirstInitial MiddleName LastName; FirstInitial will get compared to MiddleName and fail.
The directory is being broken on a comma that separates LastName, FirstName (or LastName, FirstName MiddleInitial), so it's actually correctly sorting out which part of the name is which. I need to stare at the regex some more, but I think it might be assuming that just the word after the last space is the last name.

I think this wasn't working even with the old directory so I don't think reverting back will help, incidentally, I found some old emails with Marcia unidentified if her name was on the paper as Marcia Rieke instead of Marcia J. Rieke (and same thing for George Rieke vs. George H. Rieke).

lwhitler / arxiv-mailer

Not Finding People - Marcia, Leandro #3