WIN-1255 Encoding for Type 2 Fields

Doreen-Schwartz commented 3 months ago

Hello, I understood that currently, the only supported encoding for fields that are not Type 1 is UTF-8. I was wondering if there are any intentions to add support for additional encoding?

If not, I would love to try implementing this myself. Would appreciate getting pointers to how to implement this- which parts of the encoding process would require adjusting to enable this functionality?

From what I've gathered, the default Node Buffer currently used to construct the NIST file only supports very specific types of encoding. Looking at possible alternatives, iconv-lite seems like a decent package to utilize for this.

@JonathanZahavii

ivosh commented 3 months ago

Hi @Doreen-Schwartz, this is an interesting subject. Honestly, throughout various law-enforcement agencies across the world I have not encountered such a requirement. Nevertheless, adding a conversion process into the library would be fine with me.

A question: how the charset encoding is conveyed between the NIST file producer and the NIST file consumer? Is there a special field in the NIST file? Or just by convention? That would somehow influence the futher discussion.

I don't want to tie node-nist to any specific conversion library (such as iconv-lite, node-iconv...). I am thinking to add a new property (callback function) to NistCodecOptions interface and then use that instead of Buffer.toString and Buffer.write. If you are willing to prepare a PR, please do it in several steps:

Propose the updated interface and example usage
Once the interface is discussed/reviewed/agreed, flesh in the implementation, unit tests and documentation.

Doreen-Schwartz commented 3 months ago

As far as I have seen, there is no specific field that the consumer takes into consideration. We have an existing producer that creates NIST files which the consumer does successfully read, and I didn't notice any fields that provided this information- not in the file or the documentation for the consumer. So it is just a convention for this specific consumer.

Just to give some context, when I tried uploading a NIST produced using the node-nist library to our consumer, any Hebrew characters (which is the main reason non-ASCII characters are needed) would appear as gibberish. Speaking to someone from the team that worked on the consumer, they said that it only accepts WIN-1255 encoding for Hebrew characters.

As for the potential changes, I was thinking I could add callbacks for the NistFieldEncodeOptions and NistFieldDecodeOptions interfaces, one for writing fields into the buffer and one for decoding fields from the buffer.

Something like this, maybe:

/** Encoding options for a single NIST Field. */
interface NistFieldEncodeOptions extends NistFieldCodecOptions {
  formatter?: (field: NistField, nist: NistFile) => NistFieldValue;
  informationWriter?: (informationItem: NistInformationItem, data: EncodeTracking) => EncodeTracking;
}

Which would then be used in place data.buf.write in encodeNistInformationItem, and of course a parallel for decode. Does this make sense?

Doreen-Schwartz commented 3 months ago

Another thing I noted is that there is usage of the Buffer.byteLength function in a few places. I imagine this would also require some changes.

ivosh commented 3 months ago

For decoding logic, the proposed interface enhancement looks good, such as:

/** Decoding options for a single NIST Field. */
export interface NistFieldDecodeOptions extends NistFieldCodecOptions {
  parser?: (field: NistField, nist: NistFile) => Result<NistFieldValue, NistParseError>;
  informationDecoder?: (buffer: Buffer, startOffset: number, endOffset: number): string;
}

For encoding logic, consider that the the NistFile tree gets visited couple of times, in particular first for getting right the lengths of all the fields, records, target buffer etc. And second time for actually writing the data into the buffer. So either you need two new properties (one for determining the length and another one for writing the converted data) or the new property should be rather along a generic lines of informationWritter?: (informationItem: NistInformationItem): Buffer.

Performance-wise, I'd say a charset conversion process is computationally comparable to Buffer.byteLength of an ASCII string. This means it's probably fine to perform the character conversion twice: first for determining the length and then for the serialization. Anyway, you should do at least some rudimentary performance comparison between plain 'UTF-8' and 'WIN-1255' enabled workloads.

ivosh commented 2 months ago

@Doreen-Schwartz I've finished updating the documentation. Please check https://github.com/ivosh/node-nist/commit/847395761adede959c509f092aa0d7464ece5480. For consistency, I've renamed informationWriter to informationEncoder. I hope it's fine with you. Once I get a green light from you, I'll publish a new npm version. Thanks again for the PR.

Doreen-Schwartz commented 2 months ago

@ivosh Overall looks great! Only one thing I noticed is that in the JSDoc for informationEncoder, the parameters are the same as for formatter instead of the function's actual parameters.

Otherwise seems to cover things well 👍

ivosh commented 2 months ago

@Doreen-Schwartz Thank you for the review. I am always glad to have another pair of eyeballs ;-) New release 0.10.0 is out.

ivosh / node-nist

WIN-1255 Encoding for Type 2 Fields #32