jmsv / ety-python

A Python module to discover the etymology of words
http://ety-python.rtfd.io
MIT License
144 stars 18 forks source link

case sensitivity #40

Closed parker57 closed 6 years ago

parker57 commented 6 years ago

I'm not sure about making the module case insensitive like in #39, It might be a good idea to have a case insensitive option or to look for a lowercase version of an English word if the version presented yields no results. There are however a few words I can think of that have distinct etymologies but would be identical if rendered in lowercase.

Here are a few examples,

wasp annoyingly isn't in the data but WASP (acronym for white anglo-saxon protestant) is.

Turkey refers to a country, we get the word turkey from Turkey but it's the large guinea fowl that I guess Turkish people sold.

wed is a verb, "to marry", Wed is an abbreviation for the third day of the week from Woden

$ ety turkey Turkey wed Wed don Don median Median wasp WASP -t
No origins found for word: wasp
turkey (English)
└── Turkey (English)
    └── Turquie (French)

Turkey (English)
└── Turquie (French)

wed (English)
└── weddian (Old English (ca. 450-1100))

Wed (English)
└── Wednesday (English)
    ├── Wednesdai (Middle English (1100-1500))
    └── day (English)
        └── day (Middle English (1100-1500))
            └── dæg (Old English (ca. 450-1100))

don (English)
└── dominus (Latin)
    └── domus (Latin)

Don (English)
└── Donald (English)

median (English)
└── median (Middle French (ca. 1400-1600))
    └── medianus (Latin)
        ├── -anus (Latin)
        └── medius (Latin)

Median (English)
├── -ian (English)
│   └── -anus (Latin)
└── Mede (English)
    └── Medus (Latin)
        └── Μῆδος (Ancient Greek (to 1453))

WASP (English)
└── Anglo-Saxon (English)

There are over a thousand such examples in the English language and it's important to remember that most of the words in the data aren't English, I don't know how important case sensitivity is for the other 255 languages.

Personally, for now I would leave it case sensitive, users can always make the words they parse into our functions lowercase if they want to anyway.

alxwrd commented 6 years ago

Good point.

It was mainly for #8. The main example I can think this would be useful is:

$ python -m ety -t america America
No origins found for word: america
America (English)
└── Americus (English)

But I agree that we should probably revert this change.

For the API, I'd probably suggest case-sensitive always.

For the CLI, could possibly have a "did you mean ... ?".

jmsv commented 6 years ago

Yeah I agree, although for use via the Python API I think having an ignore_case/case_insensitive or something param would be a good addition