tdecaluwe / node-edifact

Javascript stream parser for UN/EDIFACT documents.
https://www.npmjs.com/package/edifact
Apache License 2.0
50 stars 13 forks source link

Doesn't support non-ascii symbols #8

Closed Magomogo closed 3 years ago

Magomogo commented 8 years ago

This is because of regular expressions like /[A-Z0-9.,\-()/= ]*/g used in the Validator. Any advices how to solve this?

tdecaluwe commented 8 years ago

I'm aware of this limitation, however I'm currently doing a large refactor of the code base, including support for different character sets (UNOA, ONOB among others). You can find this work on the syntax-support branch right now.

tdecaluwe commented 8 years ago

What character sets are you using?

Magomogo commented 8 years ago

I've tried both iso8859 and UTF.

tdecaluwe commented 8 years ago

I merged basic support for this on the development branch. The parser now exposes an encoding(string) method which accepts an UN/EDIFACT encoding string. Right now this can be any of UNOA, UNOB, UNOC and UNOY.

For latin script (ISO/IEC 8859-1), you'd need to do:

parser.encoding('UNOC');

To use unicode you need UNOY. The other latin scripts also require an encoding translation because they aren't unicode subset, which is what node internally uses. As such I didn't support those yet.

All this is included in the latest npm package. Feel free to test!

Magomogo commented 8 years ago

Great work, thanks!

timkuijsten commented 8 years ago

Shouldn't this be automatically inferred from the UNB segment? (genuine question, I'm new to EDIFACT).

tdecaluwe commented 8 years ago

Yes it should. I was working on this, that's why I didn't close the issue yet. The parser starts in a special 'start of message' state which extracts the separators from the UNA segment if there is one. This is where this functionality should be added.

RovoMe commented 4 years ago

What's the status of this issue? Currently, version (1.2.8) fetched via npm install edifact does not respect UNOC defined in the UNB segment while declaring parser.encoding("UNOC"); before parsing the document works. This unfortunately requires to know upfront what encoding the document defines which could be done with looking up the value via RegEx or other means, which might slow down the whole processing a bit.

tdecaluwe commented 3 years ago

@RovoMe I'm going to close this issue as of 356b69533d13355b6fbf2a3ebbeb07f8d8bf837a. Automatic detecion of the encoding from the ÙNB segment is now supported through the Reader class:

let reader = new Reader({ autoDetectEncoding: true });
let result = reader.parse(document);

The parse() method returns an array of segments. Each segment is an object containing a segment name and the list of elements as an array of arrays with the actual component data.

tdecaluwe commented 3 years ago

@RovoMe I'm going to close this issue as of 356b69533d13355b6fbf2a3ebbeb07f8d8bf837a. Automatic detecion of the encoding from the ÙNB segment is now supported through the Reader class:

let reader = new Reader({ autoDetectEncoding: true });
let result = reader.parse(document);

The parse() method returns an array of segments. Each segment is an object containing a segment name and the list of elements as an array of arrays with the actual component data.