rycolab / aclpub2

MIT License
26 stars 38 forks source link

author last name "de Lhoneux" incorrectly uppercased to "De Lhoneux"? #171

Open nschneid opened 2 months ago

nschneid commented 2 months ago

As reported by @mdelhoneux in acl-org/acl-anthology#3208

I wonder if https://github.com/rycolab/aclpub2/blob/47dc3d2b896aa359e984d8e6e37ac57a8bd80acd/openreview/util.py#L62-L63 might be the culprit.

nschneid commented 2 months ago

It looks like the heuristic implemented there is that each word of the name that is all-lowercase or all-uppercase is converted to initial capitalization.

It might be better to tweak the capitalization of the words of the name only if none of the words of the name distinguish uppercase and lowercase, i.e.:

if len(last_name)>2:
    if all(n.isupper() or n.islower() for n in last_name.split(" ")):   # name does not contain any words with both uppercase and lowercase characters; impose initial-only capitalization for each word
        last_name = " ".join([n[0].upper() + n[1:].lower() if (n==n.upper() or n==n.lower()) else n for n in last_name.split(" ")]) 

UPDATE: realized the inline if condition is redundant

if len(last_name)>2:
    if all(n.isupper() or n.islower() for n in last_name.split(" ")):   # name does not contain any words with both uppercase and lowercase characters; impose initial-only capitalization for each word
        last_name = " ".join([n[0].upper() + n[1:].lower() for n in last_name.split(" ")]) 
mjpost commented 3 weeks ago

I would love to see a list of names as exported from Open Review alongside the output of this function. We should really have a unit or regression test for this function since it is very important and getting it wrong causes a lot of corrections and headaches downstream.

crux82 commented 3 weeks ago

Hi @mjpost

I will do so when I download the list of authors from the next EMNLP. I will apply the update suggested by @nschneid so we can see the difference and maybe "fine-tune" it.