Closed garretwilson closed 3 years ago
Hello,
Thanks for reporting. I'm not a Java guru so I may have done things wrong. The linter should be able to handle both files and strings, so my approach was first to load files into string then process the string. But I agree, this is something that can be improved. I'll try to do that.
@sbaudoin if you focused on the part about reading this into a string first, you missed the most important thing of all of what I was saying.
The bigger problem, here is that you're decoding the bytes incorrectly. This is a big bug.
I later looked at your unit tests, and saw that you have some unit tests for some non-ASCII characters that do not tests international characters at all, because you simply convert a string to bytes and back before it even gets to your library. Your handling of UTF-8 is broken. It may run on one system and break on another.
If you want to wait a few hours, I can either either file a pull request for this or at least give you some sample code to help out later in the day when I work on #20.
I'll do the reverse: the MR is almost ready, I'll let you review it if you agree.
@garretwilson you have the reference to the commit above. I rely on SnakeYAML's org.yaml.snakeyaml.reader.UnicodeReader class to read the (file) stream and handle the BOM.
That sounds exciting, @sbaudoin ! Give me about an hour or so from now and I will go start reviewing it.
Our polyglot project needs a Java library to validate YAML files to ensure they can be parsed by SnakeYAML and PyYAML downstream. This is the first Java YAML linting library I found, and I was about to try it out.
But before I can even use it, I'm disappointed to see that just loading a YAML file is essentially broken. One of the main
Linter
entry points incorrectly creates a string from bytes like this:The
String(byte[] bytes)
constructor should hardly if ever (I almost say "never") be called, because the charset uses the default system charset, which could be anything!Both YAML 1.1 and YAML 1.2 say that the default charset if no BOM is present must be UTF-8. But the code above just uses the default system charset, which on Linux might (or might not) be UTF-8, but on Windows would likely not be UTF-8, meaning the same linting would fail on Windows even if it worked on Linux for a YAML file encoded in UTF-8.
Moreover the code above makes no allowance whatsoever for any Byte Order Mark (BOM). This means that a BOM for UTF-16 would probably just break on all systems. (This is arguably not as bad as the first problem of default UTF-8 working on some systems but not on others.) The code should be using one of the many libraries that checks to see if a BOM is present and determines the correct charset.
And lastly, why are all these operations performed in terms of
File
andString
? A more general and useful input would beInputStream
. (If you need to read all the bytes and convert to aString
internally, you can, but the main entry point should be flexible.) A Java program might want to lint a file being streamed in from a site via an external URI, or loaded from the resources in the application (or Maven-based test suite). Sure, we can write extra code to read it all into aString
, but the common expectation is that processors work fromInputStream
. (Just take a look at most XML parsers, for example.)As I said I haven't tried it yet, and maybe the rest is 100% OK. I hope so! But at least the first two problems I've described are pretty egregious, and must be fixed. A linter isn't of much use if it isn't even parsing correctly.