(HC-82) Enable UTF-8 with BOM parsing

Iristyle commented 8 years ago

Previously Hocon::ConfigFactory.parse_file did not specify an encoding, and didn't allow for files with UTF-8 BOMs on Windows.

In reality, HOCON config files should be detected by their BOM and treated as UTF-8, UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE based on the presence of the BOM, with a fallback to UTF-8 when one is not present, based on RFC 4627 at https://www.ietf.org/rfc/rfc4627.txt

This fix is a bit naive as it may improperly load HOCON config files on Windows which are UCS-2 (a precursor to UTF-16LE). Its recommended that this be addressed later in a better File parsing scheme that peeks at the first few bytes of the file to determine the encoding correctly.

Iristyle commented 8 years ago

@DavidS here you go... note that I think longer-term we need a better fix, but this should work for now.

/cc @LarissaLane

cprice404 commented 8 years ago

@Iristyle should we go ahead and un-skip the other bom tests if they are passing anyway? While we're in here?

cprice404 commented 8 years ago

Also can you give us any info on the sense of urgency for getting a release out with these changes?

Iristyle commented 8 years ago

@cprice404 I was wrong about those tests - they are still failing. However, I question the usefulness of 1 of the skipped tests that are remaining, so if you want to yank it in a follow-up commit:

acceptBOMStartOfStringConfig

it "acceptBOMStartOfStringConfig" do
    skip("BOM not parsing properly yet; not fixing this now because it most likely only affects windows") do
      # BOM at start of file is just whitespace, so ignored
      conf = Hocon::ConfigFactory.parse_string("\uFEFFfoo=bar")
      expect(conf.get_string("foo")).to eq("bar")
    end
  end

I think it's safe to assume that BOM artifacts of files have been handled by the time an encoded string has been passed to parse_string, so I think this one can be deleted.

acceptBOMWhitespace

  it "acceptBOMWhitespace" do
    skip("BOM not parsing properly yet; not fixing this now because it most likely only affects windows") do
      # BOM here should be treated like other whitespace (ignored, since no quotes)
      conf = Hocon::ConfigFactory.parse_string("foo= \uFEFFbar\uFEFF")
      expect(conf.get_string("foo")).to eq("bar")
    end
  end

I thought this one was perhaps unnecessary as well... but... Ruby's JSON.parse in 2.1.9 deals with that just fine:

[1] pry(main)> json = '{ "foo": "\uFEFFbar\uFEFF" }'
=> "{ \"foo\": \"\\uFEFFbar\\uFEFF\" }"
[2] pry(main)> require 'json'
=> true
[3] pry(main)> JSON.parse(json)
=> {"foo"=>"bar"}

So probably still want to leave that one in, even though I'd think it's fairly edge case.

cprice404 commented 8 years ago

@Iristyle ok, thanks.

Is this blocking things? Should we prioritize getting a new release out urgently?

Iristyle commented 8 years ago

I don't believe this is blocking anything - it's more of a nuisance issue. But @DavidS / @hunner can probably chime in with more specifics since they're working on the Azure module. I hopped in just because I knew right away how to handle the problem. Also, to be fair, aside from YAML files, I don't believe we handle BOMs that well in Puppet itself either.

I added https://tickets.puppetlabs.com/browse/PUP-6447 a bit ago to increase our tolerance on Windows for BOMs so that Windows users don't have to jump through hoops to remove them -- but the discovery and actual work still need to be done.

cprice404 commented 8 years ago

I created HC-83 to capture the work of review/merge/release. We will bring that into our next sprint unless it is needed more urgently.

Iristyle commented 8 years ago

I added some additional UTF-8 / UTF-16 specs - will probably file some tickets to cover those missing pieces (though I don't think implementation work is urgent)

cprice404 commented 8 years ago

:see_no_evil:

jpinsonault commented 8 years ago

I'm 👍 asides from the comments on the expected failure test

Thanks for doing this, I had no idea windows was affected by issues like this. Even if it doesn't totally fix everything, it seems like a significant improvement

Iristyle commented 8 years ago

All updated @cprice404 / @jpinsonault

jpinsonault commented 8 years ago

👍 looks good

jpinsonault commented 8 years ago

This has gone out in hocon 1.1.2

puppetlabs / ruby-hocon

(HC-82) Enable UTF-8 with BOM parsing #82

acceptBOMStartOfStringConfig

acceptBOMWhitespace