Closed mingshey closed 4 years ago
You are quite right about the capitalization bug in line 601. But there was also another bug here: the Latin lemmata here are coming as utf8 from the xml file, and some of them have macrons and breves (including Dīa). We need to remove the diacritics without removing the vowel.
I've fixed these two bugs with commits 22bd5e3d27d6a640bf535dc388f3016122414704 and c49d4454cf3bd007aab173c5b57733377afa8b60, so this will be fixed in the next release.
Apologies for not responding to this issue for so long -- I was caught up with marking exams.
Dia (proper name) is a headword in TLL 46-th pdf, in page 63, and bookmarked as such, but its link to TLL leads to "ia" (or "iaceo") in the 26-th pdf, page 7. I suspect some bug in the keyword handling algorithm drops the capital "D".
[edit-again] The line 601 of
Perseus.pm
$word =~ s/[^a-z]//g;
is there to remove the troublesome diacritical marks(as the resolution to the issue #52), but it over-does its job. The tll-bookmarks.txt contains proper names that start with capital letters that all precedes those lower-cased words. Maybe the L-S contains only a few proper names, but "Dia" is one of the rare exceptions and it has an entry in TLL, too. The line 601 chops off its capital letter "D" and renders it "ia". So my suggestion is replace line 601 with$word =~ s/[^A-Za-z]//g;
, or possibly somewhat more sophisticated treatment might be needed.P.S. I could fork and commit this one, but I am known to myself to make goofs, so I restrain myself to local modifications and tests.