Linter doesn't correctly determine charset when reading YAML.

garretwilson commented 3 years ago

Our polyglot project needs a Java library to validate YAML files to ensure they can be parsed by SnakeYAML and PyYAML downstream. This is the first Java YAML linting library I found, and I was about to try it out.

But before I can even use it, I'm disappointed to see that just loading a YAML file is essentially broken. One of the main Linter entry points incorrectly creates a string from bytes like this:

return run(new String(Files.readAllBytes(file.toPath())), conf, file);

The String(byte[] bytes) constructor should hardly if ever (I almost say "never") be called, because the charset uses the default system charset, which could be anything!

Both YAML 1.1 and YAML 1.2 say that the default charset if no BOM is present must be UTF-8. But the code above just uses the default system charset, which on Linux might (or might not) be UTF-8, but on Windows would likely not be UTF-8, meaning the same linting would fail on Windows even if it worked on Linux for a YAML file encoded in UTF-8.

Moreover the code above makes no allowance whatsoever for any Byte Order Mark (BOM). This means that a BOM for UTF-16 would probably just break on all systems. (This is arguably not as bad as the first problem of default UTF-8 working on some systems but not on others.) The code should be using one of the many libraries that checks to see if a BOM is present and determines the correct charset.

And lastly, why are all these operations performed in terms of File and String? A more general and useful input would be InputStream. (If you need to read all the bytes and convert to a String internally, you can, but the main entry point should be flexible.) A Java program might want to lint a file being streamed in from a site via an external URI, or loaded from the resources in the application (or Maven-based test suite). Sure, we can write extra code to read it all into a String, but the common expectation is that processors work from InputStream. (Just take a look at most XML parsers, for example.)

As I said I haven't tried it yet, and maybe the rest is 100% OK. I hope so! But at least the first two problems I've described are pretty egregious, and must be fixed. A linter isn't of much use if it isn't even parsing correctly.

sbaudoin commented 3 years ago

Hello,

Thanks for reporting. I'm not a Java guru so I may have done things wrong. The linter should be able to handle both files and strings, so my approach was first to load files into string then process the string. But I agree, this is something that can be improved. I'll try to do that.

garretwilson commented 3 years ago

@sbaudoin if you focused on the part about reading this into a string first, you missed the most important thing of all of what I was saying.

The bigger problem, here is that you're decoding the bytes incorrectly. This is a big bug.

I later looked at your unit tests, and saw that you have some unit tests for some non-ASCII characters that do not tests international characters at all, because you simply convert a string to bytes and back before it even gets to your library. Your handling of UTF-8 is broken. It may run on one system and break on another.

garretwilson commented 3 years ago

If you want to wait a few hours, I can either either file a pull request for this or at least give you some sample code to help out later in the day when I work on #20.

sbaudoin commented 3 years ago

I'll do the reverse: the MR is almost ready, I'll let you review it if you agree.

sbaudoin commented 3 years ago

@garretwilson you have the reference to the commit above. I rely on SnakeYAML's org.yaml.snakeyaml.reader.UnicodeReader class to read the (file) stream and handle the BOM.

garretwilson commented 3 years ago

That sounds exciting, @sbaudoin ! Give me about an hour or so from now and I will go start reviewing it.

sbaudoin / yamllint

Linter doesn't correctly determine charset when reading YAML. #19