Open windmueller opened 6 years ago
Okay, I found the cause of this issue. QuickBlobCharsetDetector
uses the first 8.000 bytes which is not enough for some of our classes. If the first 8.000 bytes are ASCII but somewhere in the rest of the file a character in ISO-8859-1 is present, the detector will return UTF-8.
I "fixed" this by changing the charset order in the detector, but that is definitely not useful for other repositories.
BTW, sbt assembly
reports failed tests on the unchanged master, is this normal?
BTW, sbt assembly reports failed tests on the unchanged master, is this normal?
@stovocor Nope, master
build is passing for me and on Travis.
Which commit are you building, using which command and what is the test-failure output?
Which commit are you building,
Commit 94225fdc6dfea01bc1de7517f5edc4c7ac81d7fd (Tag v1.12.16)
using which command
% sbt assembly
and what is the test-failure output?
[info] Size formatter [info] - should correctly format FAILED [info] "1[,]0 KB" was not equal to "1[.]0 KB" (ByteSizeSpecs.scala:55)
I attached the complete output.
It looks like the test case depends on the current locale (German in my case).
Tests don't like your locale. Try again with LANG=en_US.UTF-8
.
That works as expected. However, in my opinion the test case should be agnostic to the locale environment.
For those who struggle with CRLF / LF processing, you might want to use my build of BFG: https://github.com/vlsi/bfg-repo-cleaner/releases/tag/v1.14.0-vlsi
I have implemented CRLF / LF normalization which works for me and does not require to have a single encoding across all files in the repository.
We are in the process of shrinking a large repository while also replacing CRLF line feeds. For this, we use the text replacement option:
$ java -jar bfg-1.12.16.jar --no-blob-protection --replace-text replacements.txt repo
The replacements file contains a single line:
regex:\r(\n)==>$1
However, this has a side effect that the encoding of some files is changed. For example, a file in the original repository is reported as
ISO-8859 text, with CRLF line terminators
by the Unix file command, while the "cleaned" repository contains the file as
UTF-8 Unicode text
A hexdump confirms that the encoding has indeed changed. Providing the option
-Dfile.encoding=ISO-8859-1
does not change anything.