moov-io / ach

ACH implements a reader, writer, and validator for Automated Clearing House (ACH) files. The HTTP server is available in a Docker image and the Go package is available.
https://moov-io.github.io/ach/
Apache License 2.0
452 stars 150 forks source link

Improve isAlphanumeric() Check With Regex #1448

Closed hshih-lead closed 1 month ago

hshih-lead commented 1 month ago

ACH Version

All

What were you trying to do?

Validation on some ach entries failed for us because they contain characters outside of the range of standard ASCII table. For example, ñ, which exists in the extended ASCII table and technically follows the NACHA rules. Similar issues have been raised in the past (#1261), as well as character being added to the set which added the character Ø.

This proposed change is to use range regex match instead of individual character match. For example, a regex of ^[\x20-\xFF]*$ will include not only all the characters mentioned above (ñ, ¦, Ø), but also more characters that are in the extended ASCII table, which users may encounter in the future.

Using a range regex should make the logic simpler (no ToUpper needed) and future proof.

What did you expect to see?

Validations passing for entries containing any characters that exist in the extended ASCII table.

adamdecaf commented 1 month ago

We'll need to benchmark a regex to see if it's fast enough. Can you give examples of additional characters that should be accepted?

hshih-lead commented 1 month ago

For us, we've only came across ñ, Ø, and others in the list so far, but it's hard to anticipate which exact character we might see next. Thus proposing the range check to be safe. Hopefully the speed of regex will be similar!

adamdecaf commented 1 month ago

Looking at the characters in that regex I think several aren't valid in Nacha files.

These all seem fine, they're ASCII

  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~  €

These I'm not sure are valid in Nacha files, but we could refine the regex. (ASCII 129 through 160)

  ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” • – — ˜ ™ š › œ  ž Ÿ   

These look to be fine and mostly what you'd expect to be supported. Extended ASCII with a few symbols (that may not be valid in all ACH applications.

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ 
adamdecaf commented 1 month ago

The regex you gave adds 300ns per isAlphanumeric call, which is reasonable for some usecases.

Can you give some more insight into where you're seeing these characters? I'm wondering if we should add overrides for accepting larger character sets or if a sensible default is to expand what's accepted.

hshih-lead commented 1 month ago

Looking at the characters in that regex I think several aren't valid in Nacha files.

These all seem fine, they're ASCII

  ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ � �

These I'm not sure are valid in Nacha files, but we could refine the regex. (ASCII 129 through 160)

 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �   

These look to be fine and mostly what you'd expect to be supported. Extended ASCII with a few symbols (that may not be valid in all ACH applications.

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ 

Make sense, I think there are some encoding variation for 129-160 that dictates whether they're printable or control characters. And I agree that 160 and above, like the latin characters you listed, are the ones we're expecting to be supported

hshih-lead commented 1 month ago

Will update with some examples shortly!

The regex you gave adds 300ns per isAlphanumeric call, which is reasonable for some usecases.

Can you give some more insight into where you're seeing these characters? I'm wondering if we should add overrides for accepting larger character sets or if a sensible default is to expand what's accepted.

hshih-lead commented 1 month ago

Here's a sample error message we got when validating entries (with confidential info removed). And for context, these are the ACH files we got from the Fed.

...
ach.FieldError IndividualName...ñ... has non alphanumeric characters: invalid character:...
...

And when I open the file with an editor, I'm seeing ñ. This tells me that we have some encoding inconsistency on our end, because ñ is 0xC3 0xB1 in UTF-8 (I used sublime, which uses UTF-8), while Ã(0xC3) and ±(0xB1) are ISO-8859-1.

So on our side, we have encoding issues to deal with. But regardless of the encoding scheme, it seems like we are getting Latin letters and special characters that's outside of the usual ASCII (127) range. Hope this example helps!

adamdecaf commented 1 month ago

It does, could you try go get github.com/moov-io/ach@master and see if that lets you decode those files?

hshih-lead commented 1 month ago

I tested with files containing all the characters mentioned above, seems like it worked without a problem. Thanks for the quick turnaround!

adamdecaf commented 1 month ago

Released in https://github.com/moov-io/ach/releases/tag/v1.40.4 - thanks for your help!