Ignore invisible characters in Unicode for certain SMS fields

binokaryg commented 2 years ago

Describe the issue Hidden characters in Unicode are causing some SMS reports to be considered invalid.

Describe the improvement you'd like Ignore the hidden characters for fields such as form codes, numeric fields, dates, etc. We might need to research the Unicode subset ranges to filter out control characters, formatting characters etc. The most commonly encountered such characters in the Nepali (Devanagari) script so far are:

U+200C: Zero-width non-joiner
U+200D: Zero-width joiner

Describe alternatives you've considered We had considered reaching out to the users and asking them not to type the hidden characters but the challenges were:

We are getting these messages (with invisible characters) from many users across multiple deployments.
Different users use different phone models. We are not sure how the hidden characters are typed.
Users themselves might not be aware that their message contains the invisible characters because they are invisible.
The message looks fine in the web app and it is not immediately clear that it contains invisible characters

garethbowen commented 2 years ago

My understanding is that the parsing works as expected but the validation fails because this field has a valid length of 1-2 and the zero-width character adds to the length.

The proposed solution strips all zero-width unicode characters out of the message before parsing and validation. I can't think of a situation where these would be valuable.

binokaryg commented 2 years ago

validation fails because this field has a valid length of 1-2 and the zero-width character adds to the length

Yes, this could be the case for the other input fields. However, when these characters come with the form code, the text is regarded as a message (shown in the reports tab) instead of reports.

garethbowen commented 2 years ago

However, when these characters come with the form code, the text is regarded as a message (shown in the reports tab) instead of reports.

Good point! I've updated the PR to strip zero-width characters from the form code too.

garethbowen commented 2 years ago

AT steps:

Load Sunsari configuration: https://github.com/medic/config-moh-nepal/blob/master/dho-sunsari-ne/
Create a CHW with a valid phone number
Register a new patient and find the patient ID (aka Medic ID)
Use the SMS testing interface to submit an SMS message: /admin/#/sms/test
Submit the message ज सद‍ <patient_id> 2 with the phone number from step 2. This message contains an invisible unicode character.

Expected: The delivery code is accepted. Actual: The delivery code causes an error.

garethbowen commented 2 years ago

Ready for AT in 7654-strip-zero-width-characters-from-message

ngaruko commented 2 years ago

Looks good. Pinging @binokaryg to verify on his end.

binokaryg commented 2 years ago

Tried it from the AT branch.

Sent: क‍ 21‌ अदृ‍श्य‍ (analyze)
Got: क 21 अदृश्य (analyze)

The person was registered successfully. It looks good.

One minor caveat of this change is that if some text deliberately has zero-width characters, they would also be stripped out. These characters are very rarely used in general Nepali texting and the meaning and pronunciation remain the same, with or without them. For our use case, it might only be the name field that is potentially altered. e.g. If a person named र्‍याले (contains ZWJ) is registered, the name will look like र्याले (without ZWJ). (It's a made-up name, I can't think of any common name that uses the characters.)

garethbowen commented 2 years ago

@binokaryg Thanks for the caveat. A few solutions I can think of...

Can you think of a way to distinguish between intended and unintended zero-width characters?
We could only replace zero-width characters in certain fields (form code, restricted fields) and leave freetext fields (patient name) untouched?
Or would a better solution be to try and match the form and delivery code as specified, and then if not found, trim the zero-width characters and try again?

binokaryg commented 2 years ago

@garethbowen

It's difficult to distinguish between them unless you look and compare the output visually.
I think this is the safest option for now.
The problem is not specific to form codes. It also applies to the numeric fields.

ngaruko commented 2 years ago

I guess this this is good to merge ATM and meet your use case @binokaryg . We could raise an improvement ticket later if needed.

binokaryg commented 2 years ago

Yes, it's good.

mrjones-plip commented 2 years ago

@njogz - can you see if we can get this into 3.16? It's passed AT, so should be a good candidate. If it doesn't make it, please put it back in 4.0. Thanks!

garethbowen commented 2 years ago

Merged to master.

@njogz If this is still wanted for 3.16.0 feel free to backport it!

medic / cht-core

Ignore invisible characters in Unicode for certain SMS fields #7654