w3c / string-search

Parking lot for advice on internationalization related string searching in general content
https://w3c.github.io/string-search/
3 stars 10 forks source link

Requirements for Indian languages #10

Open vermaprashant1 opened 3 years ago

vermaprashant1 commented 3 years ago

TDIL(Technology development for Indian languages) have collated Indian languages requirements concerned with Hindi language variations(different keystrokes and spelling variations) with examples that need to be focus and reflected in string searching recommendation. Kindly guide us for further actions.

r12a commented 3 years ago

@vermaprashant1 could you provide a link to the requirements document you created, so that Addison and i can review it?

vermaprashant1 commented 3 years ago

@Richard ,please refer draft requirement document ILanguage-requirement document-character-model.pdf that covers Hindi as a initial language. The same has been circulated with experts for inputs. This document further extended by taking variations in other Indian languages apart from Hindi.

vermaprashant1 commented 2 years ago

@r12a. Can you please share your feedback for this document,

vermaprashant1 commented 2 years ago

@richard ,please refer draft requirement document ILanguage-requirement document-character-model.pdf that covers Hindi as a initial language. The same has been circulated with experts for inputs. This document further extended by taking variations in other Indian languages apart from Hindi.

@r12a please send your feedback on the shared document. We are also investigating the different variations and rules for other additional 5 languages and will share soon.

r12a commented 2 years ago

@vermaprashant1 sorry it's taken me so long to get to this. Here are my comments.

[1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then after that point out the consequences and proposed advice. I would start with a list of problem cases that would include the following:

  1. spelling variants such as the alternation between syllable-final /n/ or nasalisation (eg. the word Hindi) – note that spelling variants occur in most languages, and so it's something any search engine typically has to consider - what other common alternative spellings occur in Hindi besides LA vs LLA (which you mention almost in passing without any examples)? It would be good to have a list of at least the more common ones.
  2. the choice of characters to represent nuktas (with a little more detail) – this is a little complicated in Devanagari because normalisation produces different results for different visual combinations, see https://r12a.github.io/scripts/devanagari/#nukta_encoding
  3. inappropriate combinations that look the same visually – you don't mention these at all, but it's a significant issue for indic scripts. See examples of this for vowel-sign and independent vowel representation at https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
  4. any combinations of combining characters with a single base that can be typed and stored in an order that causes problems - often this is resolved during normalisation, but there are problematic cases that are not resolved by normalising the text - similar issues are motivating some folks involved with Unicode to produce rendering guidelines for Thai, Khmer and Arabic scripts - these advise reordering of specific sequences of characters so as to produce consistent ordering and ensure that the text renders correctly when displayed. Again, you don't mention any such combinations, and i haven't researched this either yet for Devanagari.
  5. Matching needs to decide what to do when format characters appear in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the semantics of the text, but i suspect that in Devanagari that is not the case, and they can just be ignored. It's worth checking the full list of invisible characters that may appear in Devanagari text.
  6. Graphically similar but semantically different (confusable) code points - i would probably put the OM in this category.

Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in.

It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search. For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same. In full text search, however, searching for one should probably find the other.

[2] Section 2.2.

It is requires by the Unicode to store and interchanged the characters in the same logical order or we can say that order that user typed through the keyboards

The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters typically corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc.

Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided.

[3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi.

[4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form. I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces.

hope that helps.

vermaprashant1 commented 2 years ago

Dear Richard,

Greetings..

Thanks for sharing valuable inputs. As it was a long time back, We have already revised the character model requirements document with additional 5 more languages requirements. These are collected from the various Language Experts. Also we will go through your comments and revise documents accordingly wherever required. I will share it with you soon.

Thanks,

Prashant

On Wed, Feb 9, 2022 at 7:37 AM r12a @.***> wrote:

@vermaprashant1 https://github.com/vermaprashant1 sorry it's taken me so long to get to this. Here are my comments.

[1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then after that point out the consequences and proposed advice. I would start with a list of problem cases that would include the following:

  1. spelling variants such as the alternation between syllable-final /n/ or nasalisation (eg. the word Hindi) – note that spelling variants occur in most languages, and so it's something any search engine typically has to consider - what other common alternative spellings occur in Hindi besides LA vs LLA (which you mention almost in passing without any examples)? It would be good to have a list of at least the more common ones.
  2. the choice of characters to represent nuktas (with a little more detail) – this is a little complicated in Devanagari because normalisation produces different results for different visual combinations, see https://r12a.github.io/scripts/devanagari/#nukta_encoding
  3. inappropriate combinations that look the same visually – you don't mention these at all, but it's a significant issue for indic scripts. See examples of this for vowel-sign and independent vowel representation at https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
  4. any combinations of combining characters with a single base that can be typed and stored in an order that causes problems - often this is resolved during normalisation, but there are problematic cases that are not resolved by normalising the text - similar issues are motivating some folks involved with Unicode to produce rendering guidelines for Thai, Khmer and Arabic scripts - these advise reordering of specific sequences of characters so as to produce consistent ordering and ensure that the text renders correctly when displayed. Again, you don't mention any such combinations, and i haven't researched this either yet for Devanagari.
  5. Matching needs to decide what to do when format characters appear in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the semantics of the text, but i suspect that in Devanagari that is not the case, and they can just be ignored. It's worth checking the full list of invisible characters that may appear in Devanagari text.
  6. Graphically similar but semantically different (confusable) code points - i would probably put the OM in this category.

Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in.

It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search. For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same. In full text search, however, searching for one should probably find the other.

[2] Section 2.2.

It is requires by the Unicode to store and interchanged the characters in the same logical order or we can say that order that user typed through the keyboards

The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters typically corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc.

Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided.

[3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi.

[4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form. I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces.

hope that helps.

— Reply to this email directly, view it on GitHub https://github.com/w3c/string-search/issues/10#issuecomment-1033893471, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7B5ESYROMPOCRINGJKHV3U2KC43ANCNFSM4ZJ77KSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Thanks & Regards,

Prashant Verma I Program Manager Web Standardization Initiative(WSI) , MeitY New Delhi Cell : +91-8800521042 Website : http://tdil.meity.gov.in/WSI/AboutWSI.aspx http://tdil.mit.gov.in/WSI/AboutWSI.aspx

r12a commented 2 years ago

Sorry for the delay. I will look at your revised document. (Please point to an HTML file, if that's possible.)

In the meantime, could you take a look at https://github.com/w3c/iip/issues/119#issuecomment-1034349968 for me? Thanks.

vermaprashant1 commented 2 years ago

@r12a please find the revised document that covers 6 Indian languages requirements and variations.

aphillips commented 2 years ago

@vermaprashant1

Hello Prashant,

I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:

Bengali, is one of the notorious languages with regard to spelling variation. The different spellings of word having same meaning are accepted in the Bengali language, it should be treated as different word although have same meaning. It has 5000+ words which record spelling variations.Typically, spelling variation ranges from 2 to 8 words.Majority of words have 2 variations; some have 3, 4, 8 and more variations. At least there is one word that records 16 spelling variations.Nearly 80% words show two spelling, 7% words show three variations, 7% words show four spellings, and 6% words show more than four variations.

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention.

Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches?

I will ask other clarifications as I work through the document. Thank you so much for providing this information!!

r12a commented 2 years ago

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)?

I suspect that the word 'not' is missing between 'should' and 'be'.

asmusf commented 2 years ago

I would like to understand whether any of the alternate spellings or alternate code point sequences involve sequences that are listed as "do not use" in the Unicode standard. (Unfortunately, you'll need to read these from tables in the script chapter, they are not defined in any data files). For syntactic elements, editing tools etc. should probably flag any attempted use of "do not use" sequences.

vermaprashant1 commented 2 years ago

@aphillips

Please find the below feedback received by Bengali expert.

I was looking over your document today in preparation for taking up some of the text into string-search. I intend to add questions to this thread as I go through your document. For starters, I found this paragraph:

Bengali, is one of the notorious languages with regard to spelling variation. The different spellings of word having same meaning are accepted in the Bengali language, it should be treated as different word although have same meaning. It has 5000+ words which record spelling variations.Typically, spelling variation ranges from 2 to 8 words.Majority of words have 2 variations; some have 3, 4, 8 and more variations. At least there is one word that records 16 spelling variations.Nearly 80% words show two spelling, 7% words show three variations, 7% words show four spellings, and 6% words show more than four variations.

Can you clarify that "different spellings of word... should be treated as different word" means that spelling variations should be treated as if they were different words (not matching)? I know that's what the sentence means, but want to be sure that this was your intention.

Yes! That is the argument. Because, in a text, you never know which spelling will be used by the text creator, and if your inbuilt system does not have all possible variants, then predicting the right spelling matches will be quite problematic.

Do Bengali users expect document searches not to provide spelling variation matches at all? Or are there features in some programs for users to find such matches?

If a document search system can capture all possible variations of all the words that show spelling variations, there is no problem. The reality is that to date we have not come across any such system that can predict all possible variations of spelling. I have not even come across any database that records all possible spelling variations of Bengali words.

r12a commented 2 years ago

@aphillips in case it helps, it's much easier to understand what's going on here if you copy the Bengali examples to the bengali character app, then highlight the text line by line and click on Trans-literate. For a slightly deeper investigation, then click on Analyse text. This link will get you started.

r12a commented 2 years ago

@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as অা [U+0985 BENGALI LETTER A + U+09BE BENGALI VOWEL SIGN AA] instead of [U+0986 BENGALI LETTER AA]

There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling.

aphillips commented 2 years ago

@aphillips in case it helps, it's much easier to understand what's going on here if you copy the Bengali examples to the bengali character app, then highlight the text line by line and click on Trans-literate. For a slightly deeper investigation, then click on Analyse text. This link will get you started.

Thanks Richard. Note that the localhost link isn't that helpful :-) but here is the correct link which works better.

aphillips commented 2 years ago

@vermaprashant1 Note: the document linked to here: https://tdil.meity.gov.in/WSI/ILs-variations.html is on a server with an expired certificate (it expired at midnight on 24 July), so I can't view it currently.

vermaprashant1 commented 2 years ago

@vermaprashant1 The section about Encoding variations lists for Bengali the circumgraph vowel signs which are canonically equivalent in Unicode. It doesn't however mention combinations which are not recommended, such as অা [U+0985 BENGALI LETTER A + U+09BE BENGALI VOWEL SIGN AA] instead of আ [U+0986 BENGALI LETTER AA]

There are a fair number of these in Indian scripts, esp. for letters with nukta. Is it something you think should be in the document? (I haven't looked closely yet at all the language sections.) This is a misspelling rather than an alternative spelling.

Here id the feedback received by Bengali expert:

  1. These circumgraph vowel signs are typically known as vowel allographs. In Bengali, these are called 'svarachinha "vowel signs".
  2. In total, nine (9) vowel graphemes have these allographs: ā-kār, i-kār, ῑ-kār, u-kār, ῡ-kār, e-kār, ai-kaār, o-kār, and au-kār.
  3. Each vowel allograph must be assigned a unique Unicode value.
  4. Vowel allographs are never combined with vowel graphemes. They can only be combined with consonants and clusters (conjuncts).
  5. আ (ā) is not a combination of অ (a) and া-কার (ā-kār). আ (ā) is a completely separate character with a unique Unicode value. Similarly, অ (a) is a separate character with another unique Unicode value. There should be no confusion regarding this.

We have not taken combinations which are not recommended in the document. It covers only alternative spellings/encoding and facts which are used by particular community.

aphillips commented 2 years ago

@vermaprashant1 Thanks for your reply. Note that the expiration of the certificate on the meity.gov.in server means we don't have access to the document. Would it be possible for you to send me a copy to use as a reference?

vermaprashant1 commented 2 years ago

Please find the document. [ILs-text_variations-final.pdf](https://github.com/w3c/string-search/files/9278962/ILs-text_variations-final.pdf)

vermaprashant1 commented 1 year ago

Please find the document.[ ILs-text_variations-final.pdf ](u

@r12a any update on this file?