streetsidesoftware / cspell-dicts

Various cspell dictionaries
Other
217 stars 197 forks source link

A new dictionary: first name, last name items #3236

Closed arkid15r closed 2 months ago

arkid15r commented 3 months ago

Similar to the companies dictionary would it make sense to work on people names list? I see some commonly used names in en_US, en_GB, cpp, maybe also other dictionaries. The idea is to have a separate dictionary for this sort of items.

The example terms:

Thanks for considering the idea.

ccoVeille commented 3 months ago

I like that idea yes. I thought about it.

I faced many issue like that when using cspell (over typos or codespell which logic is inverted)

cspell expects everything to be invalid, unless in a dictionary/list, while typos and codespell have to keep adding things as invalid.

The main issue with firstname list (that would be extended to a name list at some point) is that you will somehow face a very long list to allow, and you will always have to add things. But yes, it would be simpler for people using cspell, they will have to add less things to cspell ignore list in their repository.

About how to build such a list, we could maintain it by using Wikipedia pages, demographic data, GitHub parsing, Wikidata export.

But I like the idea, I may add it to my personal project https://github.com/ccoveille/jargon

Jason3S commented 3 months ago

@arkid15r and @ccoVeille,

A dictionary of people names could definitely be useful. People names are one of the most common false positives that occur.

Having the list grow organically by contribution should be sufficient as opposed to an active curation process. The challenge is how to moderate when a name gets added to the list and how/if to segment the list. Common names from all cultures/languages should be allowed.

Any ideas on a clear set of rules that a moderator of the list should follow when accepting additions / removals?

ccoVeille commented 3 months ago

Thanks for your reply.

I agree with you it could be organic.

Let's start with the easiest topic: the removal rules.

I think the one raising the point of having something removed would come with the reason. I mean they will complain by saying "blah blah whatever should be removed because it's a misspelling of blah blah in this language…

About the rules for adding rules, I have no clear idea how things would evolve because one first name/last name in one language, could be a typo in another. But with cultural exchange and name spreading, the list

So I would say adding something to the name dictionary, should be validated against a popularity threshold.

This is something that may be validated by calling external APIs such as github.com or wikidata

I'm thinking also about looking at cspell.json available in public repository, where we will find name (but also words) that are added by people to allow to solve cspell issues right now

arkid15r commented 3 months ago

how/if to segment the list. Common names from all cultures/languages should be allowed.

I believe it'd be hard to segment based on country/culture of origin. Not sure if other segmentation criteria are applicable here. So from my point of view the segmentation is not an issue here (unless I'm not aware of some performance or other important consequences of not having a dict segmented)

Any ideas on a clear set of rules that a moderator of the list should follow when accepting additions / removals?

What are the rules for existing dictionaries? Check references, Google search results, other sources? I'd also like to have a better understanding of common reasons for dictionary items removal.

Thank you!

Jason3S commented 3 months ago

I would like to be able to share the moderation process when it comes to adding/changing/removing words from the various dictionaries. To do that, I think it makes sense to have a base set of rule to follow. If we had been discussing a dictionary for a programming language like SmallTalk, then the rules are simpler. Since this issue is about People names, it opens up possible issues that wouldn't otherwise occur.

Segmentation Options

I brought this up because I would prefer to avoid segmentation based upon gender, religion, or nationality. I would rather have it clearly stated in the dictionary to avoid future conflicts.

  1. Bag of Words - all the names are just added into a single file. No distinctions are made.
  2. By character set - Names using characters from Latin, Greek, Cyrillic, Arabic, ...
  3. Grouped by approximate word frequency (occurrences per million words). Think of the 80/20 rule. The top 20% of names cover 80% of the occurrences. (I have not measured this, the idea is based upon long tail graphs.)
  4. Some other suggestion...

Additions and removals.

Removals

Additions

ccoVeille commented 3 months ago

Your reply is great. It makes a lot of sense.

May I suggest you to complete CONTRIBUTION.MD and maybe CODE_OF_CONDUCT.md (even if this one is pretty complete yet) with these useful information

ccoVeille commented 3 months ago

Bag of Words - all the names are just added into a single file. No distinctions are made.

I would say this. The 2 second option is uneasy, when you will/may face debates like: this one is Cyrillic not Greek because "whatever character is a Cyrillic not Greek. Or worst, this word is Korean, Chinese, Japanese …

ccoVeille commented 3 months ago

Grouped by approximate word frequency (occurrences per million words). Think of the 80/20 rule. The top 20% of names cover 80% of the occurrences. (I have not measured this, the idea is based upon long tail graphs.)

This approach would be only valid if you intended to add only a subset of words. For performance reason for example. But you said you wanted to allow any names.

Also splitting by frequency would lead to debate such as "John is/was pretty common" or "we already added Jon, should we group it with John".

Or if you split by frequency because let's say you plan to have "common names" and "rare names" you would have people arguing "but whatever is not common" (or the opposite), or simply you will have PR for common adding a name already present in rare. And the issue will be to either add or ask user to import rare names.

That's why I prefer the "bags of words"

Jason3S commented 3 months ago

@arkid15r and @ccoVeille,

I have created a first pass at the People Names dictionary. It is currently empty. Please take a look at the README.md.

ccoVeille commented 2 months ago

Hey @arkid15r @Jason3S and anyone interested

I got an idea to help with populating the people-name dictionary

Let's continue the discussion there