typiconman / ponomar

Ponomar: a liturgics suite for the Orthodox Church
http://www.ponomar.net/
GNU General Public License v3.0
37 stars 12 forks source link

Implement a simple search feature #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
There is existing code for a simple search feature in the search.java file. 
However, it is not yet connected to the Search menu item, and I'm not yet sure 
if the code works.

Let's discuss the status of the code, and any technical difficulties such as 
the Saint ID problems that Aleks mentioned by email.

Original issue reported on code.google.com by ps008v...@gmail.com on 6 Feb 2015 at 5:05

lemtom commented 3 years ago

I'm currently working on this. afbeelding I've changed the search results from text to tabular data. Clicking on a row opens the corresponding commemoration in a new window.

Are there any specific features that should be added?

mamyt commented 3 years ago

I think this is a great feature that is really needed.

Some comments regarding Unicode. Firstly, we must make sure that irrespective of how the user enters the text, it is decomposed so that searching works properly. The problem lies in that diacritical marks (mostly for the Latin and Greek alphabets) can be entered in one of two ways: either as a precomposed character ä or as a decomposed character (that is as a + ◌̈). Although visually both look identical, the underlying representation is different. According to Unicode specifics, both should be treated identically. However, this needs to be checked that it has been so implemented. I am afraid that JAVA may not implement this feature correctly.

As well, regarding Church Slavonic searching, I think that it would be mandatory to have two options: strict and relaxed. In strict, the search engine searches for the exact spelling of the word. In relaxed, the search engine searches using a normalised form of the word (for example, diacritical marks are stripped and {и, і}, {е, є}, {о, ѻ, Ѡ}, {ꙗ, ѧ} (as examples) are treated within each set as equivalent). As well, superscript letters would need to be handled somehow. Finally, abbreviations could be expanded (I have a list of all (modern) Church Slavonic abbreviations, which would cover us for all cases). The same could also apply to Greek with respect to stripping the diacritical marks. This is especially important since not everyone will necessarily be familiar with exactly how to spell a word in Church Slavonic and the spelling of the word can change during word formation, e.g. ѻ҆те́цъ (nominative singular), ѻ҆тє́цъ (genitive plural), and then пра́ѻтецъ, which all should be found if we search for “ѻтецъ”. Normalising the forms would give отецъ, отецъ, and праотецъ which will now be easily found.

On Thu, 24 Dec 2020 at 13:54, Tom L. notifications@github.com wrote:

I'm currently working on this. [image: afbeelding] https://user-images.githubusercontent.com/10900989/103088705-e36af200-45eb-11eb-810a-5164c3776410.png I've changed the search results from text to tabular data. Clicking on a row opens the corresponding commemoration in a new window.

Are there any specific features that should be added?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/typiconman/ponomar/issues/8#issuecomment-750874666, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSMKOOOYV5AOESMCLXUMSTSWM2WNANCNFSM4VIFQZQQ .

lemtom commented 3 years ago

I tested with French, and it seems to handle both versions of é fine.

I've currently implemented a checkbox that strips the accents from both the search term and the saint name. So far I've been testing in French, since that's a language I actually know. With the checkbox unchecked, the search term "Melece" doesn't give "St. Mélèce" as a result, with the checkbox it does. I've also added a similar checkbox to ignore capitalization.

The library I'm using (java.text.Normalizer) can probably normalize the church slavonic to some degree, but I'll probably have to find a way to handle the abbreviations (hardcoding per your list, I guess) and the spelling differences related to word formations.

I'm fairly sure the normalization I've implemented so far can handle diacritical marks in Greek, though I'll have to find some examples to be certain.

afbeelding

mamyt commented 3 years ago

For polytonic Greek, I can suggest the form ἅγιος (masculine form of holy). With diacritical marks stripped, it should also match the monotonic Greek form άγιος (and vice versa). If you need any help with the Church Slavonic, let me know and I can send you the required files.

As well, there is the question of Chinese normalisation regarding the two forms of Chinese: simplified and traditional. Can JAVA handle this or not? If it can, then we should enable it; otherwise it makes little point to implement. An example to try: traditional: 格奧爾吉; simplified: 格奥尔吉 (both forms correspond to George in Chinese). Only the middle two characters are different.

Another question: do you only search the name of the commemoration or do you search any text in the corresponding html file?

On Mon, 28 Dec 2020 at 21:42, Tom L. notifications@github.com wrote:

I tested with French, and it seems to handle both versions of é fine.

I've currently implemented a checkbox that strips the accents from both the search term and the saint name. So far I've been testing in French, since that's a language I actually know. With the checkbox unchecked, the search term "Melece" doesn't give "St. Mélèce" as a result, with the checkbox it does. I've also added a similar checkbox to ignore capitalization.

The library I'm using (java.text.Normalizer) can probably normalize the church slavonic to some degree, but I'll probably have to find a way to handle the abbreviations (hardcoding per your list, I guess) and the spelling differences related to word formations.

I'm fairly sure the normalization I've implemented so far can handle diacritical marks in Greek, though I'll have to find some examples to be certain.

[image: afbeelding] https://user-images.githubusercontent.com/10900989/103242123-8194ea00-4955-11eb-839f-058e55da2c83.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/typiconman/ponomar/issues/8#issuecomment-751858088, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSMKOOHD6JCNFRTODTZIALSXDUUZANCNFSM4VIFQZQQ .

lemtom commented 3 years ago

To easily test the cases you give me, I think I'm gonna extract some of the methods I've written to a utility class and write tests for them. I'll probably try to write tests for some of the existing classes as well later on.

I could use help with the church slavonic as well, since I can't even read Cyrillic (I interpreted the і in your equivalent sets as the Latin i at first, and was looking into romanization. I know better now.). Do you know a good source for all the equivalent sets?

I'll implement normalization under the "strip diacritical marks" checkbox in languages that require it, and then the translation strings can be different to indicate it.

Currently I'm only searching for the name, but I can easily add a checkbox to search the getLife() as well.

mamyt commented 3 years ago

I can send you the information about equivalent sets and also all the abbreviations in Church Slavonic. Would you mind if I e-mailed the files directly to you? I do not wish them to be made public just yet. Would the e-mail address from your website work?

I think searching on the life as an option could be useful, especially if we are trying to weed out any errors that may be found in the texts.

On Tue, 29 Dec 2020 at 11:20, Tom L. notifications@github.com wrote:

To easily test the cases you give me, I think I'm gonna extract some of the methods I've written to a utility class and write tests for them. I'll probably try to write tests for some of the existing classes as well later on.

I could use help with the church slavonic as well, since I can't even read Cyrillic (I interpreted the і in your equivalent sets as the Latin i at first, and was looking into romanization. I know better now.). Do you know a good source for all the equivalent sets?

I'll implement normalization under the "strip diacritical marks" checkbox in languages that require it, and then the translation strings can be different to indicate it.

Currently I'm only searching for the name, but I can easily add a checkbox to search the getLife() as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/typiconman/ponomar/issues/8#issuecomment-752024565, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSMKOPC6TMJ3FPAGXQA2QLSXGUM7ANCNFSM4VIFQZQQ .

lemtom commented 3 years ago

The e-mail on my website should work. My spam filter seems a bit overzealous (it caught e-mails from someone from a different project), so it might be prudent to reply here once you've mailed me, so I know when to check.

Searching through the life is now implemented: afbeelding

I currently have these test cases based on your comments and my own test in French

//First boolean is ignoreDiacritics and the second is ignoreCapitalization
    @Disabled
    @Test
    void chineseCases(){
        assertTrue(searchName("格奥尔吉", "格奧爾吉", "lang", true, false));
    }

    @Test
    void greekCases(){
        assertTrue(searchName("άγιος", "ἅγιος", "gr", true, false));
        assertFalse(searchName("άγιος", "ἅγιος", "gr", false, false));
    }

    @Test
    void slavonicCases(){
        assertTrue(searchName("ѻ҆тє́цъ", "пра́ѻтецъ", "cu", true, false));
        assertFalse(searchName("ѻ҆тє́цъ", "пра́ѻтецъ", "cu", false, false));
    }

    void frenchCases(){
        assertTrue(searchName("melece", "Mélèce", "fr", true, true));
        assertFalse(searchName("melece", "Mélèce", "fr", true, false));
        assertFalse(searchName("melece", "Mélèce", "fr", false, true));
    }

I've had to expand the scope of the characters I'm stripping to catch the "COMBINING CYRILLIC PSILI PNEUMATA", but it's caught now.

As expected, there's no easy way to switch from traditional to simplified Chinese and vice versa. There's a library that might handle this, but that seems a bit excessive for such a minor feature (and its documentation is in Chinese).