pjheslin / diogenes

Diogenes: an environment for reading Latin and Greek
https://d.iogen.es/d
Other
56 stars 10 forks source link

TLL pdf search: "Dia" leads to "ia" #61

Closed mingshey closed 4 years ago

mingshey commented 4 years ago

Dia (proper name) is a headword in TLL 46-th pdf, in page 63, and bookmarked as such, but its link to TLL leads to "ia" (or "iaceo") in the 26-th pdf, page 7. I suspect some bug in the keyword handling algorithm drops the capital "D".

[edit-again] The line 601 of Perseus.pm $word =~ s/[^a-z]//g; is there to remove the troublesome diacritical marks(as the resolution to the issue #52), but it over-does its job. The tll-bookmarks.txt contains proper names that start with capital letters that all precedes those lower-cased words. Maybe the L-S contains only a few proper names, but "Dia" is one of the rare exceptions and it has an entry in TLL, too. The line 601 chops off its capital letter "D" and renders it "ia". So my suggestion is replace line 601 with $word =~ s/[^A-Za-z]//g;, or possibly somewhat more sophisticated treatment might be needed.

P.S. I could fork and commit this one, but I am known to myself to make goofs, so I restrain myself to local modifications and tests.

pjheslin commented 4 years ago

You are quite right about the capitalization bug in line 601. But there was also another bug here: the Latin lemmata here are coming as utf8 from the xml file, and some of them have macrons and breves (including Dīa). We need to remove the diacritics without removing the vowel.

I've fixed these two bugs with commits 22bd5e3d27d6a640bf535dc388f3016122414704 and c49d4454cf3bd007aab173c5b57733377afa8b60, so this will be fixed in the next release.

Apologies for not responding to this issue for so long -- I was caught up with marking exams.