Closed binokaryg closed 2 years ago
My understanding is that the parsing works as expected but the validation fails because this field has a valid length of 1-2 and the zero-width character adds to the length.
The proposed solution strips all zero-width unicode characters out of the message before parsing and validation. I can't think of a situation where these would be valuable.
validation fails because this field has a valid length of 1-2 and the zero-width character adds to the length
Yes, this could be the case for the other input fields. However, when these characters come with the form code, the text is regarded as a message (shown in the reports tab) instead of reports.
However, when these characters come with the form code, the text is regarded as a message (shown in the reports tab) instead of reports.
Good point! I've updated the PR to strip zero-width characters from the form code too.
AT steps:
ज सद <patient_id> 2
with the phone number from step 2. This message contains an invisible unicode character.Expected: The delivery code is accepted. Actual: The delivery code causes an error.
Ready for AT in 7654-strip-zero-width-characters-from-message
Looks good. Pinging @binokaryg to verify on his end.
Tried it from the AT branch.
The person was registered successfully. It looks good.
One minor caveat of this change is that if some text deliberately has zero-width characters, they would also be stripped out. These characters are very rarely used in general Nepali texting and the meaning and pronunciation remain the same, with or without them. For our use case, it might only be the name field that is potentially altered.
e.g. If a person named र्याले
(contains ZWJ) is registered, the name will look like र्याले
(without ZWJ).
(It's a made-up name, I can't think of any common name that uses the characters.)
@binokaryg Thanks for the caveat. A few solutions I can think of...
@garethbowen
I guess this this is good to merge ATM and meet your use case @binokaryg . We could raise an improvement ticket later if needed.
Yes, it's good.
@njogz - can you see if we can get this into 3.16? It's passed AT, so should be a good candidate. If it doesn't make it, please put it back in 4.0. Thanks!
Merged to master
.
@njogz If this is still wanted for 3.16.0 feel free to backport it!
Already reported in the CHT Forum
Describe the issue Hidden characters in Unicode are causing some SMS reports to be considered invalid.
Describe the improvement you'd like Ignore the hidden characters for fields such as form codes, numeric fields, dates, etc. We might need to research the Unicode subset ranges to filter out control characters, formatting characters etc. The most commonly encountered such characters in the Nepali (Devanagari) script so far are:
Describe alternatives you've considered We had considered reaching out to the users and asking them not to type the hidden characters but the challenges were: