validate_data_dir.sh causes some errors for non-ascii characters

kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

http://kaldi-asr.org

Other

14.24k stars 5.32k forks source link

validate_data_dir.sh causes some errors for non-ascii characters #4157

Open sw005320 opened 4 years ago

sw005320 commented 4 years ago

I have encountered the following issues when we use validate_data_dir.sh:

utils/validate_data_dir.sh: text contains 13489 lines with non-printable character

It happens almost all non-English corpora. It seems that this error depends on the Linux (probably locale?) environment, and I had a problem in one environment but it did not happen in the other environment, but multiple people reported the same issue.

This check seems to be introduced recently from the following changes and I did not encounter the issue before. https://github.com/kaldi-asr/kaldi/commit/e6729879ff1a4d677fd9efe3777308c8faca8130#diff-312b1ebd7cf78ac1a6fd63ebbe575800 https://github.com/kaldi-asr/kaldi/commit/31c2baeb53737f11f9a06a7ca9cc27d9b0e6438a#diff-312b1ebd7cf78ac1a6fd63ebbe575800

Since utils/validate_data_dir.sh is used for various places (e.g., it is called from copy_data_dir.sh), it is difficult to solve this by only adding --non-print

aarora8 commented 4 years ago

I can try to work on it and try to fix it. I can try to recreate it on COE (where it happened) and from @danpovey @jtrmal @m-wiesner suggestion can try to fix it.

johnjosephmorgan commented 4 years ago

While you're at it, could you look at #4144. I think it might be related.

o-alexandre-felipe commented 4 years ago

This issue is actually related to a check I introduced. The C++ makes some assumptions about tokens https://github.com/kaldi-asr/kaldi/blob/00625e85130ace8baa24ee1c59423268916919fa/src/util/text-utils.cc#L105 this check was supposed to ensure that the assumed condition holds for the text data. If a better support for UTF-8 is required I would suggest to change the tokenization as well.

danpovey commented 4 years ago

I think the issue is not that fundamental, but because the locale C.UTF-8 is not always defined. In any case I think the easiest fix would be to simply remove the check. We already check for that kind of compatibility issue in utils/ validate_text.pl, e.g. see validate_utf8_whitespaces. (Checks that there are no whitespaces that would not be treated as whitespaces by the C++ code).

Actually, @o-alexandre-felipe, IIRC you never told us precisely what issue you were trying to solve. I should have asked though.

On Wed, Jul 22, 2020 at 7:28 PM o-alexandre-felipe notifications@github.com wrote:

This issue is actually related to a check I introduced. The C++ makes some assumptions about tokens https://github.com/kaldi-asr/kaldi/blob/00625e85130ace8baa24ee1c59423268916919fa/src/util/text-utils.cc#L105 this check was supposed to ensure that the assumed condition holds for the text data. If a better support for UTF-8 is required I would suggest to change the tokenization as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662400431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5VEBY3FN6WCACA7DTR43EODANCNFSM4ORZ4X6A .

danpovey commented 4 years ago

I think the only things that IsToken() is rejecting are non-printable ASCII characters, and 255, which is already checked. ASCII and Unicode coincide for all these, and none of them, so this isn't

I believe the bytes we want to reject are: bytes less than 0x1f except for bytes 0x9 and 0xA (tab and newline, respectively); byte 0x7f (DEL), and byte 0xFF (&nbsp).

However, this should be implemented in validate_text.pl; the extra grep command and the extra option make it unclear and cause portability problems. None of the problematic characters can appear in UTF-8 except as themselves so the check would be the same in ASCII and in utf-8. Could perhaps be done with an array lookup for each byte?

Yenda, do you have time?

On Wed, Jul 22, 2020 at 7:38 PM Daniel Povey dpovey@gmail.com wrote:

I think the issue is not that fundamental, but because the locale C.UTF-8 is not always defined. In any case I think the easiest fix would be to simply remove the check. We already check for that kind of compatibility issue in utils/ validate_text.pl, e.g. see validate_utf8_whitespaces. (Checks that there are no whitespaces that would not be treated as whitespaces by the C++ code).

Actually, @o-alexandre-felipe, IIRC you never told us precisely what issue you were trying to solve. I should have asked though.

On Wed, Jul 22, 2020 at 7:28 PM o-alexandre-felipe < notifications@github.com> wrote:

This issue is actually related to a check I introduced. The C++ makes some assumptions about tokens https://github.com/kaldi-asr/kaldi/blob/00625e85130ace8baa24ee1c59423268916919fa/src/util/text-utils.cc#L105 this check was supposed to ensure that the assumed condition holds for the text data. If a better support for UTF-8 is required I would suggest to change the tokenization as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662400431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5VEBY3FN6WCACA7DTR43EODANCNFSM4ORZ4X6A .

jtrmal commented 4 years ago

yeah, I'll put it on my todo. ETA -- today. y.

On Wed, Jul 22, 2020 at 1:54 PM Daniel Povey notifications@github.com wrote:

I think the only things that IsToken() is rejecting are non-printable ASCII characters, and 255, which is already checked. ASCII and Unicode coincide for all these, and none of them, so this isn't

I believe the bytes we want to reject are: bytes less than 0x1f except for bytes 0x9 and 0xA (tab and newline, respectively); byte 0x7f (DEL), and byte 0xFF (&nbsp).

However, this should be implemented in validate_text.pl; the extra grep command and the extra option make it unclear and cause portability problems. None of the problematic characters can appear in UTF-8 except as themselves so the check would be the same in ASCII and in utf-8. Could perhaps be done with an array lookup for each byte?

Yenda, do you have time?

On Wed, Jul 22, 2020 at 7:38 PM Daniel Povey dpovey@gmail.com wrote:

I think the issue is not that fundamental, but because the locale C.UTF-8 is not always defined. In any case I think the easiest fix would be to simply remove the check. We already check for that kind of compatibility issue in utils/ validate_text.pl, e.g. see validate_utf8_whitespaces. (Checks that there are no whitespaces that would not be treated as whitespaces by the C++ code).

Actually, @o-alexandre-felipe, IIRC you never told us precisely what issue you were trying to solve. I should have asked though.

On Wed, Jul 22, 2020 at 7:28 PM o-alexandre-felipe < notifications@github.com> wrote:

This issue is actually related to a check I introduced. The C++ makes some assumptions about tokens

https://github.com/kaldi-asr/kaldi/blob/00625e85130ace8baa24ee1c59423268916919fa/src/util/text-utils.cc#L105 this check was supposed to ensure that the assumed condition holds for the text data. If a better support for UTF-8 is required I would suggest to change the tokenization as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662400431 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO5VEBY3FN6WCACA7DTR43EODANCNFSM4ORZ4X6A

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662410078, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX7CJ5IYBWPLYLCDDD3R43HN7ANCNFSM4ORZ4X6A .

jtrmal commented 4 years ago

Shinji, can you send me (privately) a repro? y.

On Wed, Jul 22, 2020 at 1:58 PM Jan Trmal jtrmal@gmail.com wrote:

yeah, I'll put it on my todo. ETA -- today. y.

On Wed, Jul 22, 2020 at 1:54 PM Daniel Povey notifications@github.com wrote:

I think the only things that IsToken() is rejecting are non-printable ASCII characters, and 255, which is already checked. ASCII and Unicode coincide for all these, and none of them, so this isn't

I believe the bytes we want to reject are: bytes less than 0x1f except for bytes 0x9 and 0xA (tab and newline, respectively); byte 0x7f (DEL), and byte 0xFF (&nbsp).

However, this should be implemented in validate_text.pl; the extra grep command and the extra option make it unclear and cause portability problems. None of the problematic characters can appear in UTF-8 except as themselves so the check would be the same in ASCII and in utf-8. Could perhaps be done with an array lookup for each byte?

Yenda, do you have time?

On Wed, Jul 22, 2020 at 7:38 PM Daniel Povey dpovey@gmail.com wrote:

I think the issue is not that fundamental, but because the locale C.UTF-8 is not always defined. In any case I think the easiest fix would be to simply remove the check. We already check for that kind of compatibility issue in utils/ validate_text.pl, e.g. see validate_utf8_whitespaces. (Checks that there are no whitespaces that would not be treated as whitespaces by the C++ code).

Actually, @o-alexandre-felipe, IIRC you never told us precisely what issue you were trying to solve. I should have asked though.

On Wed, Jul 22, 2020 at 7:28 PM o-alexandre-felipe < notifications@github.com> wrote:

This issue is actually related to a check I introduced. The C++ makes some assumptions about tokens

https://github.com/kaldi-asr/kaldi/blob/00625e85130ace8baa24ee1c59423268916919fa/src/util/text-utils.cc#L105 this check was supposed to ensure that the assumed condition holds for the text data. If a better support for UTF-8 is required I would suggest to change the tokenization as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662400431 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAZFLO5VEBY3FN6WCACA7DTR43EODANCNFSM4ORZ4X6A

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662410078, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX7CJ5IYBWPLYLCDDD3R43HN7ANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

@dpovey, @jtrmal, I am not familiar with any encoding that treats 0xFF as the non-breaking space. In fact, it is a letter in many ISO8859 encodings (either Latin y with diaeresis, ÿ, or a Cyrillic letter џ, except some obscure but also standardized variant, where it's a ӵ). NBSP is 0xA0 in all of the ISO8859 tables, AFAICR, which unfortunately may occur in UTF-8. In KOI8-R 0xFF it is also a letter. If we are going to ignore the encoding, then we probably should not assume anything about it whatsoever, including 0x7F or 0xFF. Assumptions about symbols below 0x1F are likely fine.

jtrmal commented 4 years ago

I'm gonna work on it today. I was hoping I could get repro from Shinji because I believe these checks are already being done. Y.

On Fri, Jul 24, 2020 at 07:37 kkm000 notifications@github.com wrote:

@dpovey https://github.com/dpovey, @jtrmal https://github.com/jtrmal, I am not familiar with any encoding that treats 0xFF as the non-breaking space. In fact, it is a letter in many ISO8859 encodings (either Latin y with diaeresis, ÿ, or a Cyrillic letter џ, except some obscure but also standardized variant, where it's a ӵ). NBSP is 0xA0 in all of the ISO8859 tables, AFAICR, which unfortunately may occur in UTF-8. In KOI8-R 0xFF it is also a letter. If we are going to ignore the encoding, then we probably should not assume anything about it whatsoever, including 0x7F or 0xFF. Assumptions about symbols below 0x1F are likely fine.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663355097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX5ZCFS4YWJDSNI25RLR5EMY3ANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

Actually, I'm sure the error was caused by a UTF-8 byte. For example, the characters in 3-byte codepoint encoding En xx xx are all hieroglyphics from various scripts for n={3,9}, and En would fail the isascii test. The semantics of IsToken is it returns true iff its argument is a non-empty, whitespace-free string. whitespace is an interesting concept (is form-feed a whitespace?), but I think as a first-order approximation, '\t', '\r', '\n' and ' ' are spaces, and everything else goes.

danpovey commented 4 years ago

OK, character 255 is nbsp in ASCII and it doesn't occur in UTF-8. In any case it's disallowed.

On Fri, Jul 24, 2020 at 2:41 PM kkm000 notifications@github.com wrote:

Actually, I'm sure the error was caused by a UTF-8 byte. For example, the characters in 3-byte codepoint encoding En xx xx are all hieroglyphics from various scripts for n={3,9}, and En would fail the isascii test. The semantics how IsToken is it returns true iff its argument is a non-empty, whitespace-free string. whitespace is an interesting concept (is format-feed a whitespace?), but I think as a first-order approximation, '\t', '\r', '\n' and ' ' are spaces, and everything else goes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663371104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5CMGXGENYX4JSHLYDR5EUKXANCNFSM4ORZ4X6A .

danpovey commented 4 years ago

BTW, we made the decision some time ago to no longer support non-UTF-8 encodings other than ASCII. This is primarily because python3 makes it extremely difficult to deal with bytes as just bytes, as we were doing in the C++ and perl.

On Sat, Jul 25, 2020 at 6:12 PM Daniel Povey dpovey@gmail.com wrote:

OK, character 255 is nbsp in ASCII and it doesn't occur in UTF-8. In any case it's disallowed.

On Fri, Jul 24, 2020 at 2:41 PM kkm000 notifications@github.com wrote:

Actually, I'm sure the error was caused by a UTF-8 byte. For example, the characters in 3-byte codepoint encoding En xx xx are all hieroglyphics from various scripts for n={3,9}, and En would fail the isascii test. The semantics how IsToken is it returns true iff its argument is a non-empty, whitespace-free string. whitespace is an interesting concept (is format-feed a whitespace?), but I think as a first-order approximation, '\t', '\r', '\n' and ' ' are spaces, and everything else goes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663371104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5CMGXGENYX4JSHLYDR5EUKXANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

Generally makes sense, of course UTF-8 is a great thing, I'm only thinking of all the different encodings that could exist in old data sets. While we're talking about modern encodings, LDC still keeps such antiques as .sph files. How many recipes this would break. IMO, if you see a 0xFF, 9 out of 10 you deal with an ISO8859-nn or some other odd encoding letter.

danpovey commented 4 years ago

This decision was made a long time ago, and those old encodings should all be easily convertible to UTF-8. We weren't able to handle python2->python3 upgrade process without doing this.

On Sat, Jul 25, 2020 at 10:53 PM kkm000 notifications@github.com wrote:

Generally makes sense, of course UTF-8 is a great thing, I'm only thinking of all the different encodings that could exist in old data sets. While we're talking about modern encodings, LDC still keeps such antiques as .sph files. How many recipes this would break. IMO, if you see a 0xFF, 9 out of 10 you deal with an ISO8859-nn or some other odd encoding letter.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663864283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYNNE4UR34HG7B3KQTR5LWVZANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

Yeah, it's a nightmare either way.

Then we must validate UTF-8 properly, at the least. And there are a dozen or more different characters with the spacing semantics in Unicode. Not some obscure ones, actively used in e.g. Chinese writing.

BTW, cannot find any evidence of 8-bit ASCII. If I'm to believe Wikipedia,

Extended ASCII (EASCII or high ASCII) character encodings are eight-bit or larger encodings that include the standard seven-bit ASCII characters, plus additional characters. Using the term "extended ASCII" on its own is sometimes criticized, because it can be mistakenly interpreted to mean that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, neither of which is the case.

There are many extended ASCII encodings (more than 220 DOS and Windows codepages).

And this is not touching on the Unicode normalization and locales... Hope we can escape that hell. Sorting is an issue, and even uppercase <-> lowercase conversions are very non-trivial. C.UTF-8 is nice, but apparently not available on the Mac? I'm just afraid to get bogged down in this thing. If bash, python and C++ have different ideas how to order an array lexicographically, we're in trouble.

danpovey commented 4 years ago

Don't worry, it's mostly already done, validate_text.pl already checks for UTF-8 whitespaces.. Just need to add the checks that I mentioned and it should be fine. RE lexicographical ordering, this is always done using C locale in bash (that's why we always have LC_ALL=C) and we don't rely on sorting in python.

On Sat, Jul 25, 2020 at 11:20 PM kkm000 notifications@github.com wrote:

Yeah, it's a nightmare either way.

Then we must validate UTF-8 properly, at the least. And there are a dozen or more different characters with the spacing semantics in Unicode. Not some obscure ones, actively used in e.g. Chinese writing.

BTW, cannot find any evidence of 8-bit ASCII. If I'm to believe Wikipedia,

Extended ASCII (EASCII or high ASCII) character encodings are eight-bit or larger encodings that include the standard seven-bit ASCII characters, plus additional characters. Using the term "extended ASCII" on its own is sometimes criticized, because it can be mistakenly interpreted to mean that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, neither of which is the case.

There are many extended ASCII encodings (more than 220 DOS and Windows codepages).

And this is not touching on the Unicode normalization and locales... Hope we can escape that hell. Sorting is an issue, and even uppercase <-> lowercase conversions are very non-trivial. C.UTF-8 is nice, but apparently not available on the Mac? I'm just afraid to get bogged down in this thing. If bash, python and C++ have different ideas how to order an array lexicographically, we're in trouble.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663867071, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3S4PBM3CGSUOE4UFDR5LZ2DANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

Yes, coincidentally, I read that exact script yesterday and noticed that it used Perl's own UTF-8 facility to determine if all text decodes. No checks are performed at all if it does not.

I would really limit the table keys and all the stuff that we heavily depend on being sortable and case-convertible to 7-bit ASCII (keys in scp/text files etc.) About the rest, I would not worry so much except (a) the token-separation rules and (b) codepoints that have no visual representation, but rather control the typographical representation of text.

Re (a): If we are going full Unicode, then I think we should support all the various spaces, because, like, a space is a space. I use a custom keyboard layout I made for myself with extra symbols on AltGr (right Alt), and my Alt+Space is a non-breaking space. I often use it when typing documents after 'a' or 'I', and between numbers and units, e.g. '5 m/s²'. Some word processors that care about good typography have automatic rules for such things (InDesign, IIRC, has it), Since text can come from a variety of sources, why should we prohibit it instead of treating as a space? All we care is tokens, non-empty sequences of codepoints free of word-separating codepoints (an interesting fact about Unicode is that it intentionally does not define "characters" or "symbols" as shapes, except in a non-normative annex, and instead uses shouty-case English descriptions). A 'ZERO-WIDTH SPACE', for example, is an important typographical mark in scripts that do not normally use spaces but allow line-break at this point; they still have "tokens" in Kaldi sense, but are written without visible spaces. A good Lorem Ipsum example: https://en.wikipedia.org/wiki/Zero-width_space#Usage. Vary browser width, and see how the text flows.

Re (b), it's a group of codepoints that also supply typographic information and indicate points where words can be optionally broken. The only one I know of is U+00AD 'SOFT HYPHEN' (and possibly the only one), which indicates the point at which an extrasesquipedalian word can be broken with a hyphenation mark at EOL in English text, or some equivalent in other scripts. This I think we could just silently remove without a warning; they are entirely optional.

There are additional invisible marks, that are used, by the way, even if French typography, so it's not some exotics, "joiners" and "non-joiners", which indicate points where a token cannot be broken. I'm not very familiar with the semantics of these. I'll read the standard, there is a whole section on them.

There are other things in Unicode such as different normalization modes, which normalize sequences of codepoints for equality comparisons. I think this is probably beyond what we want to handle; if the transcript is coming from a single source, it's unlikely to vary in the way characters are composed from multiple codepoints. E.g., in French text, len('être') is nearly certain to be 4. So dealing with stuff like below is probably out of scope. Python3 and Perl support normalization, C++ probably does not, and I do not even want to think about what do various incarnations of sed and awk do.

Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34)
>>> etre1 = '\u0065\u0302\u0074\u0072\u0065'
>>> etre2 =       '\u00EA\u0074\u0072\u0065'
>>> print(etre1, etre2)
être être
>>> len(etre1), len(etre2), etre1 == etre2
(5, 4, False)

Just in case what you see in your browser is not what I see in my browser,

The Python docs show normalization examples, but there are at least 4 different modes of it defined in Unicode, and I would not jump headlong into it, lest you see me from condyles up for a couple months.

To summarize, it seems to me that the most reasonable default treatment of Unicode points would be:

A space is a space is a space. Do not disallow non-breaking spaces, treat all spaces equally. Any spacing character, and only the spacing character, is a token-breaker. I do not understand why we must discriminate against the non-breaking spaces when there are so many various space types around, and their semantics is clearly same: they separate "words", whatever it means in a particular script, if not in print, then in an encoded file. Some scripts don't even use the U+0020 for space at all.
Remove certain codepoints from tokens silently on prep, error if present on validation. I'll read the standard and see if there is any besides U+00AD falls into this category.
Very optionally, and I'd postpone that till lazier times, if prep or validation of data is always done with a language with solid, non-broken Unicode support (Perl qualifies; Python I do not know enough, but likely), normalize on prep, error on validation. Otherwise, leave everything else as is.

we don't rely on sorting in python.

The more PyTorch/TF we do, the more we will. IMO, limiting to 7-bit ASCII just anything sortable that we depend on being sorted must be a hard rule.

danpovey commented 4 years ago

Kirill, I think you're making this more complicated/time-consuming than it needs to be. This is just a very small fix to avoid certain control characters.

The rationale for excluding nbsp is that it can lead to situations where (a) whether or not the user wanted it to be treated as space is ambiguous, and (b) different Kaldi tools may treat it in different ways, e.g. there are many awk and perl scripts and we have LC_ALL=C, for good reasons.

There is no way to change the sorting rules (LC_ALL=C) or the rules about what is allowed / not allowed without major compatibility problems, so lets not waste time arguing about it.

On Sun, Jul 26, 2020 at 2:24 PM kkm000 notifications@github.com wrote:

Yes, coincidentally, I read that exact script yesterday and noticed that it used Perl's own UTF-8 facility to determine if all text decodes. No checks are performed at all if it does not.

I would really limit the table keys and all the stuff that we heavily depend on being sortable and case-convertible to 7-bit ASCII (keys in scp/text files etc.) About the rest, I would not worry so much except (a) the token-separation rules and (b) codepoints that have no visual representation, but rather control the typographical representation of text.

Re (a): If we are going full Unicode, then I think we should support all the various spaces, because, like, a space is a space. I use a custom keyboard layout I made for myself with extra symbols on AltGr (right Alt), and my Alt+Space is a non-breaking space. I often use it when typing documents after 'a' or 'I', and between numbers and units, e.g. '5 m/s²'. Some word processors that care about good typography have automatic rules for such things (InDesign, IIRC, has it), Since text can come from a variety of sources, why should we prohibit it instead of treating as a space? All we care is tokens, non-empty sequences of codepoints free of word-separating codepoints (an interesting fact about Unicode is that it intentionally does not define "characters" or "symbols" as shapes, except in a non-normative annex, and instead uses shouty-case English descriptions). A 'ZERO-WIDTH SPACE', for example, is an important typographical mark in scripts that do not normally use spaces but allow line-break at this point; they still have "tokens" in Kaldi sense, but are written without visible spaces. A good Lorem Ipsum example: https://en.wikipedia.org/wiki/Zero-width_space#Usage. Vary browser width, and see how the text flows.

Re (b), it's a group of codepoints that also supply typographic information and indicate points where words can be optionally broken. The only one I know of is U+00AD 'SOFT HYPHEN' (and possibly the only one), which indicates the point at which an extrasesquipedalian words can be broken with a hyphenation mark at EOL in English text, or some equivalent in other scripts. This I think we could just silently remove without a warning; they are entirely optional.

There are additional invisible marks, that are used, by the way, even if French typography, so it's not some exotics, "joiners" and "non-joiners", which indicate points where a token cannot be broken. I'm not very familiar with the semantics of these. I'll read the standard, there is a whole section on them.

There are other things in Unicode such as different normalization modes, which normalize sequences of codepoints for equality comparisons. I think this is probably beyond what we want to handle; if the transcript is coming from a single source, it's unlikely to vary in the way characters are composed from multiple codepoints. E.g., in French text, len('être') is nearly certain to be 4. So dealing with stuff like below is probably out of scope. Python3 and Perl support normalization, C++ probably does not, and I do not even want to think about what do various incarnations of sed and awk do.

Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34)

etre1 = '\u0065\u0302\u0074\u0072\u0065' etre2 = '\u00EA\u0074\u0072\u0065' print(etre1, etre2) être être len(etre1), len(etre2), etre1 == etre2

(5, 4, False)

Just in case what you see in your browser is not what I see in my browser, [image: image] https://user-images.githubusercontent.com/8228229/88472544-dd4fd880-cec8-11ea-8cc9-bf81cbf4eb92.png

The Python docs show normalization examples, but there are at least 4 different modes of it defined in Unicode, and I would not jump headlong into it, lest you see me from condyles up for a couple months.

To summarize, it seems to me that the most reasonable default treatment of Unicode points would be:

A space is a space is a space. Do not disallow non-breaking spaces, treat all spaces equally. Any spacing character, and only the spacing character, is a token-breaker. I do not understand why we must discriminate against the non-breaking spaces when there are so many various space types around, and their semantics is clearly same: they separate "words", whatever it means in a particular script, if not in print, then in an encoded file. Some scripts don't even use the U+0020 for space at all.

Remove certain codepoints from tokens silently on prep, error if present on validation. I'll read the standard and see if there is any besides U+00AD falls into this category.

Very optionally, and I'd postpone that till lazier times, if prep or validation of data is always done with a language with solid, non-broken Unicode support (Perl qualifies; Python I do not know enough, but likely), normalize on prep, error on validation. Otherwise, leave everything else as is.

we don't rely on sorting in python.

The more PyTorch/TF we do, the more we will. IMO, limiting to 7-bit ASCII just anything sortable that we depend on being sorted must be a hard rule.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663941671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7MBJUL5ZTGJY5PQCTR5PD23ANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

If this is all about changing the grep line from

n_non_print=$(LC_ALL="C.UTF-8" grep -c '[^[:print:][:space:]]' $data/text)

n_non_print=$(LC_ALL=C grep -cv '[\t\x20-\x7E\x80-\xFE]' $data/text)

then I'm all for it.

Also, validate_text.pl is worth looking into. Its logic is currently incorrect, if I understand it. In pseudocode,

if (file is correctly formed utf-8)
  if (decoded utf-8 contains (control characters except \t) or 0x7F or 0xFF)
    (the file is bad)
  else
    (the file is good)
else
  (the file is good)

Probably no point even looking whether it's a correctly formed UTF-8 or not. The same grep expression will do.

danpovey commented 4 years ago

If this is all about changing the grep line https://github.com/kaldi-asr/kaldi/commit/e6729879ff1a4d677fd9efe3777308c8faca8130#diff-312b1ebd7cf78ac1a6fd63ebbe575800R130 from

n_non_print=$(LC_ALL="C.UTF-8" grep -c '[^[:print:][:space:]]' $data/text)

to

n_non_print=$(LC_ALL=C grep -cv '[\t\x20-\x7E\x80-\xFE]' $data/text)

then I'm all for it.

I think the right place to do this would be in validate_text.pl, so a more informative error message can be printed; also perl is less susceptible to version issues than grep. But if you want to do it, changing the grep is OK for me.

Also, validate_text.pl is worth looking into. Its logic is currently incorrect, if I understand it. In pseudocode,

if (file is correctly formed utf-8) if (file contains (control characters except \t) or 0x7F or 0xFF) (the file is bad) else (the file is good) else (the file is good)

Probably no point even looking whether it's a correctly formed UTF-8 or not. The same grep expression will do.

You're right, even if not UTF-8 we should still check that there are no non-printable or whitespace ASCII characters except for space, \t and \n.

—

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-663961636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2ZM4PVR2FPBQZSK4DR5PWVLANCNFSM4ORZ4X6A .

kkm000 commented 4 years ago

Yeah, I do not like the error message, too. "1 line contains invalid characters, good luck finding it!" :)

o-alexandre-felipe commented 4 years ago

https://github.com/kaldi-asr/kaldi/issues/4157#issuecomment-662404569

@danpovey

From my thoughts at June 16th:

I was planning to submit pull request is to check if the text has unprintable characters in the validate_data_dir script, it goes silently until the stage four of clean_and_segment_data_nnet3.sh, and there it fails with this message steps/cleanup/lattice_oracle_align.sh: oracle ctm is in tmp/clean_long_utts/lattice_oracle/ctm align-text '--special-symbol=***' ark:tmp/clean_long_utts/lattice_oracle/text ark:tmp/clean_long_utts/lattice_oracle/oracle_hyp.txt ark,t:- ERROR (align-text[5.5.0~1-7a0b]:main():align-text.cc:97) In text1, the utterance 3784a364-9278-11e9-9809-023e0b39bc5e-00007500-00010500-1 contains unprintable characters. That means there is a problem with the text (such as incorrect encoding). [ Stack-Trace: ] align-text(kaldi::MessageLogger::LogMessage() const+0xb1a) [0x55e39d0ea8d8] align-text(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x55e39d06729f] align-text(main+0x53c) [0x55e39d065ed7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f8a35abdb97] align-text(_start+0x2a) [0x55e39d0658aa] kaldi::KaldiFatalErrorutils/scoring/wer_per_utt_details.pl: Note: handling as utf-8 text

stale[bot] commented 4 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.