Replacing text changes file encoding

rtyley / bfg-repo-cleaner

Removes large or troublesome blobs like git-filter-branch does, but faster. And written in Scala

https://rtyley.github.io/bfg-repo-cleaner/

GNU General Public License v3.0

11.11k stars 549 forks source link

Replacing text changes file encoding #251

Open windmueller opened 6 years ago

windmueller commented 6 years ago

We are in the process of shrinking a large repository while also replacing CRLF line feeds. For this, we use the text replacement option:

$ java -jar bfg-1.12.16.jar --no-blob-protection --replace-text replacements.txt repo

The replacements file contains a single line:

regex:\r(\n)==>$1

However, this has a side effect that the encoding of some files is changed. For example, a file in the original repository is reported as

ISO-8859 text, with CRLF line terminators

by the Unix file command, while the "cleaned" repository contains the file as

UTF-8 Unicode text

A hexdump confirms that the encoding has indeed changed. Providing the option -Dfile.encoding=ISO-8859-1 does not change anything.

windmueller commented 6 years ago

Okay, I found the cause of this issue. QuickBlobCharsetDetector uses the first 8.000 bytes which is not enough for some of our classes. If the first 8.000 bytes are ASCII but somewhere in the rest of the file a character in ISO-8859-1 is present, the detector will return UTF-8.

I "fixed" this by changing the charset order in the detector, but that is definitely not useful for other repositories.

BTW, sbt assembly reports failed tests on the unchanged master, is this normal?

javabrett commented 6 years ago

BTW, sbt assembly reports failed tests on the unchanged master, is this normal?

@stovocor Nope, master build is passing for me and on Travis.

Which commit are you building, using which command and what is the test-failure output?

windmueller commented 6 years ago

Which commit are you building,

Commit 94225fdc6dfea01bc1de7517f5edc4c7ac81d7fd (Tag v1.12.16)

using which command

% sbt assembly

and what is the test-failure output?

[info] Size formatter [info] - should correctly format FAILED [info] "1[,]0 KB" was not equal to "1[.]0 KB" (ByteSizeSpecs.scala:55)

I attached the complete output.

sbt-assembly-94225fd.txt

It looks like the test case depends on the current locale (German in my case).

javabrett commented 6 years ago

Tests don't like your locale. Try again with LANG=en_US.UTF-8.

windmueller commented 6 years ago

That works as expected. However, in my opinion the test case should be agnostic to the locale environment.

vlsi commented 5 years ago

For those who struggle with CRLF / LF processing, you might want to use my build of BFG: https://github.com/vlsi/bfg-repo-cleaner/releases/tag/v1.14.0-vlsi

I have implemented CRLF / LF normalization which works for me and does not require to have a single encoding across all files in the repository.