uwol / proleap-cobol-parser

ProLeap ANTLR4-based parser for COBOL
MIT License
136 stars 74 forks source link

Specifying different codepages for Cobolfiles in CobolPreprocessorImpl #23

Closed Reinhard-Prehofer closed 6 years ago

Reinhard-Prehofer commented 7 years ago

The current implementation of the CobolPreprocessor seems to allow default (utf-8) character encoding of the cobol-files, only. Typically the cobol sources files are SingleByteCharSets only, like ebcdic (ibm-1441) or iso-8859-1(5), win-1252 only and would thus need a code page conversion before running through the parser.

I would appreciate the chance of parameterizing the codepage for Cobolsources - thus being able to use an additional parameter for the Charset in the InputStreamReader ....

public InputStreamReader(InputStream in,
                 **Charset cs)**

referring to your method:

    public String process(final File cobolFile, final List<File> copyFiles, final CobolSourceFormatEnum format,
            final CobolDialect dialect) throws IOException {
        LOG.info("Preprocessing file {}.", cobolFile.getName());

        final InputStream inputStream = new FileInputStream(cobolFile);
        final InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
        final BufferedReader bufferedInputStreamReader = new BufferedReader(inputStreamReader);
        final StringBuffer outputBuffer = new StringBuffer();

Changing only the Codepage of the cobolsources, say from iso-8859 to utf-8 will lead to processing Errors sooner or later. Just think of 01 Name Pic x(30) value "Günter Mörgän" and statements like if Name(2:1) = "ü" etc etc ... These will only work in SBCS and not in DBCS. on the other hand, not converting the Codepage and thus letting the Parser Interpret the characters as utf-8 will lead to these well known grotesque and misinterpreted characters ...

uwol commented 6 years ago

Correct, currently the parser reads COBOL files as UTF-8.

With 1d63891d3c3b311a774a05b59a524d8a6d30d27c in the parameter object a charset can be provided. I did not add a unit test, as currently I do not have access to files encoded with EBCDIC.

The assumption of the fix is that all charsets mentioned by you are provided by Java 8 according to the Charsets specification. If relevant charsets are missing, I'm not sure if additional libraries are required.

If the fix does not work or relevant charsets are missing, please reopen the issue. Thanks!