schorfES / node-lintspaces

A validator for checking different kinds of whitespaces in your files.
https://npmjs.com/package/lintspaces
MIT License
30 stars 17 forks source link

[EditorConfig] Unrecognized values in charset property throw exceptions. #48

Open EricDunsworth opened 6 years ago

EricDunsworth commented 6 years ago

Lintspaces doesn't claim to support .editorconfig's charset property, but appears to be reading it and attempting to use it in a way that can lead to exceptions.

An exception can occur if the charset property's value is unrecognized. Out of EditorConfig's 5 supported charsets, lintspaces seems to understand latin1, utf-8 and utf-16le, but errors out with utf-16be and utf-8-bom.

The same is also true for unset/invalid values. Btw I didn't cover the charset property in #47 since its exception would've prevented other warnings from appearing.

Would it be possible to either completely ignore the charset property or improve lintspace's handling of it?

Code/Output samples... The samples below demonstrate how lintspaces currently reacts to a charset = utf-8-bom property.

.editorconfig

root = true

[*]
charset = utf-8-bom

test.txt (BOM's presence doesn't matter)

test

check.js

var Validator = require('lintspaces');

var validator = new Validator({editorconfig: '.editorconfig'});
validator.validate('test.txt');

var results = validator.getInvalidFiles();

//Derived from https://stackoverflow.com/a/10729284
const util = require('util');

console.log(util.inspect(results, {showHidden: false, depth: null}));

Output when running node check.js in Windows:

internal/fs.js:21
    throw new Error(`Unknown encoding: ${encoding}`);
    ^

Error: Unknown encoding: utf-8-bom
    at assertEncoding (internal/fs.js:21:11)
    at getOptions (fs.js:80:5)
    at Object.fs.readFileSync (fs.js:549:13)
    at Validator._loadFile (C:\lintspaces-testing\node_modules\lintspaces\lib\Validator.js:123:18)
    at Validator.validate (C:\lintspaces-testing\node_modules\lintspaces\lib\Validator.js:66:8)
    at Object.<anonymous> (C:\lintspaces-testing\check.js:4:11)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)

Expected output of node check.js:

{}
schorfES commented 6 years ago

Hi @EricDunsworth, thank you for this detailed issue. This seems to be related to a unsupported encoding setting which is passed from editorconfig to readFileSync. I think lintspaces needs to check if the received encoding is a supported value for node's readFileSync(). I'm not clear about what will happen when lintspaces provides a fallback (utf-8-bom to utf-8) if the encoding isn't supported. I need to dive a bit deeper into it and will fix this problem in a future release. However, if you need a faster fix, I would be very happy to accept a PR to solve this.

Bartheleway commented 1 week ago

It seems Node itself doesn't support the "BOM" version of encodings.

Also looking at the code, it seems that this package try to read the file with the given encoding without trying to verify that the file is in this encoding. This could be achieve with ced for example. It use some C++ locally compiled source which feels bad when trying to reduce the number of dependencies to install (require node-gyp which requires some C++ compiler).

I assume having a fallback to make all the other checks is good as "BOM" flag doesn't change meaning of EOL, spaces, ending line.

According to editorconfig website, it supports : latin1, utf-8, utf-8-bom, utf-16be and utf-16le And Node seems to support : latin1, utf-8 and utf-16le

I am really not an expert of encoding. It seems safe to fallback from utf-8-bom to utf-8 and from utf-16be to utf-16le. What do you think ?

schorfES commented 1 week ago

We could consider using chardet instead. This library has no dependencies and does not require native code or bindings (see here https://www.npmjs.com/package/chardet#chardet). However, @Bartheleway makes a good point that we might need to either narrow down the list of supported encodings or attempt to open the file based on the encoding setting. Here’s an example approach:

const chardet = require('chardet');
const fs = require('fs');

// Path to the file you want to check
const filePath = 'path/to/your/file.txt';

// List of supported encodings
const supportedEncodings = ['latin1', 'utf-8', 'utf-16le'];

// Detect the file encoding
const encoding = chardet.detectFileSync(filePath);

// Check if the file encoding is supported
if (!supportedEncodings.includes(encoding)) {
  throw new Error('Unsupported encoding');
}

// Warn if the encoding does not match the settings
if (typeof validator.settings.encoding === 'string' && validator.settings.encoding !== encoding) {
  console.warn('Encoding mismatch: setting does not match file encoding');
}

// Read the file with the detected encoding
fs.readFileSync(filePath, encoding);
Bartheleway commented 1 week ago

I looked into chardet and it doesn't seems to detect BOM ... also I think we should report encoding error like package is doing for lines but I didn't find how to add an error to the whole file.

Bartheleway commented 1 week ago

Maybe we can use jschardet

schorfES commented 1 week ago

I'm not sure if we need to detect it (we could skip detection), as Node.js doesn't support it anyway. If chardet or jschardet detects an unsupported encoding, we could attempt to read the file with a warning (hint). If this fails, we would then report an error.

I didn't find how to add an error to the whole file.

This is currently not possible given the structure of our reporting. However, we could attach the hint/warning to the first line of the file as a workaround.

{
  "/path/to/file.ext": {
    "1": [
      {
        "line": 1,
        "code": "ENCODING_UNSUPPORTED",
        "type": "hint",
        "message": "Unsupported encoding of file."
      },
      {
        "line": 1,
        "code": "ENCODING_ERROR",
        "type": "warning",
        "message": "Can not read file."
      },
    ],
  }
}

Note: Lintspaces currently supports two types of messages: hint (info) and warning. An error level is not currently supported but can be introduced. The warning level currently is the highest available and may be considered equivalent to an error in other linting tools, such as ESLint.