git-for-windows / git

A fork of Git containing Windows-specific patches.
http://gitforwindows.org/
Other
8.37k stars 2.54k forks source link

Diff for *.doc files did not work as expected #355

Closed asdqwezx closed 9 years ago

asdqwezx commented 9 years ago

git for windows 1.9.5 has nice feature — comparing text content of .doc files when diff calculated git 2.5.x treats .doc files like ordinary binary files

rimrul commented 9 years ago

This behaviour (and simmilar ones for pdf,rtf docx) seems to be caused by the commits that edited /etc/gitattributes. This file and its supporting scripts apparently did not make it into the 2.X versions. Could you test this after the following steps: add these lines to your gitattributes file.

*.doc   diff=astextplain
*.DOC   diff=astextplain
*.pdf   diff=astextplain
*.PDF   diff=astextplain

download the old script 'astextplain' to your git bin directory.

add these lines to your gitconfig

[diff "astextplain"]
    textconv = astextplain

I can't test this currently since I have no git for windows 2.5 installed.

Akkuzin commented 9 years ago

Thanks for clarifying! Your method works But simply adding new scripts will not be enough — some underlying utilities needed: antiword, pdftotext, docx2txt. And installing tool manually is very unconvinient! Plain text comparison feature should be out of the box — it is kind of a dealbreaker for working with document archives.

rimrul commented 9 years ago

It was not meant as a final sollution, but as a check if it still works. Documents are basically binary files so conversion to text is not exactly an out of the box feature. gits main purpose is sourcecode versioning, that means it's optimized for plain text. Since both scripts (astexplain and docx2txt) plus antiword add up to roughly 250 kB here, I don't think @dscho would have much off a problem if we added them back in. I'll take a look at making an installer and comparing the sizes when I find time for that. I'll notify you about any pull requests. EDIT: I can't find pdftotext in github.com/msys/* so maybe this feature will stay missing.

dscho commented 9 years ago

I can't find pdftotext in github.com/msys/* so maybe this feature will stay missing.

You will find it in the poppler package. You can install it via pacman -S mingw-w64-x86_64-poppler in a 64-bit Git for Windows SDK, but it will take a while and download ~5MB.

Just for fun, I contributed a package definition for Xpdf which also provides a pdftotext.exe, but this still needs the libstdc++ DLL, so I am not sure just how much of a size penalty we would incur.

rimrul commented 9 years ago

My git 1.9.4 and 2.5.1 currently display the same message, when I'm not inside a repository: mingw32__ 2015-09-10 22 30 06 mingw64__ 2015-09-10 22 31 24 I will have another look at this inside a repository on monday. The gitattributes and gitconfig file install fine though. git config -l returns diff.astextplain.textconv=astextplain as intended. The doc/docx converters add roughly 94 kiB to the installer, but I haven't included pdftotext yet since I don't want to add the whole package. sizes

EDIT: Just noticed that my git 2.5.1 searches the ~/.gitattributes instead of /etc/gitattributes. I should propably look at regular Linux behaviour for this.

rimrul commented 9 years ago

Next Update: Git 2.x searches /$(prefix)/etc/gitattributes instead of /etc/gitattributes. I guess Git 1.x was just built without the prefix. I apparently also forgot to include the antiword mapping files. I'm getting closer.

rimrul commented 9 years ago

I'm still having slight differences in the doc conversion, but I got it running. I've also gotten docx2txt running as intended. I should probably use the unzip package instead of adding unzip to the git-extra package. I guess I should also introduce a separate package for antiword and its 30-ish mapping files.

This is my current Git 2.5.1 result for doc to text conversion:

$ git diff --cached
diff --git a/a.doc b/a.doc
new file mode 100644
index 0000000..f345d74
--- /dev/null
+++ b/a.doc
@@ -0,0 +1,3 @@
+^M
+a.doc^M
+^M

This is the intended result that Git 1.9.4 produces:

$ git diff --cached
WARNING: terminal is not fully functional
diff --git a/a.doc b/a.doc
new file mode 100644
index 0000000..f345d74
--- /dev/null
+++ b/a.doc
@@ -0,0 +1,3 @@
+
+a.doc
+

I would assume it's a conversion issue between CRLF and LF, but I don't know why it would be converted in Git 1.9.4 and not in Git 2.5.1. Any ideas @dscho? maybe we could fix this by running the doc file through dos2unix before feeding it to antiword? No results with pdf conversion yet. The current installer for this issue is 171 kiB bigger than the 2.5.1 installer I've built. The size of it might be wrong since I think I didn't rebuild git-extra and added the required files manually to the fitting locations.

TODO:

dscho commented 9 years ago

I should probably use the unzip package instead of adding unzip to the git-extra package

I would add it here: https://github.com/git-for-windows/build-extra/blob/b1ada75d730cd6b5e74ac202841b46e28a5399f0/make-file-list.sh#L89 (and here: https://github.com/git-for-windows/build-extra/blob/b1ada75d730cd6b5e74ac202841b46e28a5399f0/make-file-list.sh#L73).

In any case, would you have some code to show? If you do, please open a Pull Request (with the prefix "DO NOT MERGE YET:").

rimrul commented 9 years ago

For those who follow this thread, but have not had a look at git-for-windows/build-extra#75: I've opened a pull request for the first version of these changes, but they aren't ready to be merged yet. @dscho and I have created packages for antiword and docx2txt and created the pull requests Alexpux/MSYS2-packages#345 and Alexpux/MINGW-packages#781. Both pull requests have been merged and I'm currently working on the package integration, the CRLF issue and PDF support.