kuhumcst / glossematics

The life of Louis Hjelmslev.
https://glossematics.dk
4 stars 1 forks source link

Deal with identical labels for different entities #57

Closed simongray closed 2 years ago

simongray commented 2 years ago

e.g. "Acta Jutlandica" is both #narch10 and #npub19, so they will inevitably clash.

The search input should allow the user to differentiate which variant they want.

simongray commented 2 years ago

One way to solve this is to test set membership of every name in use in every category of names. When there is a clash, the name is replaced in every category where it exists with "name (category)". This should happen before returning the search metadata, i.e. it gets cached along with it. The actual names in the database need to not change, just the ones in the search metadata.

simongray commented 2 years ago

Perhaps an even better solution is to keep a separate set of problematic names and then simply dynamically affix (entity type) when needed to the labels, since this will avoid creating superfluous labels in the metadata field. We really just need to the entity type labels in the actual search form.

simongray commented 2 years ago

It seems that remarkably few words actually suffer from this issue.

simongray commented 2 years ago

When not normalising by case, these are duplicates:

'("New Haven" "operation" "Kalispel" "definition" "Acta Jutlandica" "Société genevoise de linguistique" "connexion")

The same, but capitalised when put in the db:

'("Operation" "Kalispel" "Phonematics" "New haven" "Société genevoise de linguistique" "Definition" "Connexion")

The only change is "Acta Jutlandica" being replaced by "Phonematics".


When normalising by lower-case, these are duplicates:

'("indholdsplan"
 "acta jutlandica"
 "achumawi"
 "kongressen i london"
 "kenem"
 "lakkisk"
 "temativ"
 "rationel semantik (pleremik)"
 "conseil international"
 "byzantisk"
 "altasisk"
 "jyllands-posten"
 "kenologi"
 "lingvistkredsen"
 "tokharisk"
 "le maître phonétique"
 "nordisk filologmøde"
 "realitet som grammatisk kategori"
 "filosofisk selskab"
 "ikke-indoeuropæiske sprog"
 "kyrkansk"
 "aranta"
 "glossematics"
 "alfa"
 "maidu"
 "operation"
 "prosodem"
 "tabassaransk"
 "definition"
 "new haven"
 "oldnordisk"
 "société genevoise de linguistique"
 "lingvistkongressen"
 "statsbiblioteket"
 "on the structural interpretation of diphthongs"
 "avarisk"
 "århus kommunehospital"
 "hamburg universitet"
 "unesco"
 "labarsaransk"
 "ikke - indoeuropæisk"
 "société de linguistique"
 "kalispel"
 "bulletin du cercle linguistique de copenhague"
 "connexion"
 "phonematics")

If we capitalise upon inserting into the db, only these are left:

("acta jutlandica"
 "operation"
 "definition"
 "new haven"
 "société genevoise de linguistique"
 "kalispel"
 "connexion"
 "phonematics")
simongray commented 2 years ago

One thing to note about capitalising strings is that the function converts the entire string to lower-case and then capitalises only the first character. This is how "Acta Jutlandica" and "ACTA JUTLANDICA" (the raw data) become just "Acta jutlandica" (jutlandica fully lower-zase).

While it makes sense to capitalise only the first character in the case of document titles, for full names it probably makes more sense to capitalise every word in a string for most entity names, excluding person names which might have special rules, e.g. de/da in Romance language names.


The duplicates when capitalising every word for non-person entities:

'("New Haven"
 "Operation"
 "Société Genevoise De Linguistique"
 "Kalispel"
 "Acta Jutlandica"
 "Phonematics"
 "Definition"
 "Connexion")

The same duplicates exit when lower-casing:

'("acta jutlandica"
 "operation"
 "definition"
 "new haven"
 "société genevoise de linguistique"
 "kalispel"
 "connexion"
 "phonematics")