msys2 / MSYS2-packages

Package scripts for MSYS2.
https://packages.msys2.org
BSD 3-Clause "New" or "Revised" License
1.28k stars 486 forks source link

Environment variable to preserve carriage returns with sed & awk #2315

Open whiteinge opened 3 years ago

whiteinge commented 3 years ago

Describe the issue

I would like to discuss a portable method to preserve carriage returns when using sed and awk.

The use-case is: I'm working on a script to process a file that may have either UNIX line-endings or Windows line-endings. Once the file has been processed, the original line-endings should still be intact.

Background

I have been reading the history behind Cygwin and MSYS2 automatically converting \r\n to \n. I realize this is a contentious topic and I sincerely don't wish to reignite any flame wars. I am hoping we can calmly discuss pragmatic solutions. :sunglasses:

For the sake of completeness and also to make sure I correctly understand the history, here is a short recap of my understanding of discussions that span several years. (Corrections welcome.)

Cygwin once shipped the auto-CR conversion [1] which some people enjoyed. It was eventually identified as non-portable behavior and then fixed [2]; light flamewars ensued but eventually died out. Later the auto-CR conversion was re-added to MSYS2 [3] as people noticed it had been removed and more flame wars ensued. However the argument isn't quite the same between Cygwin and MSYS2 because while Cygwin attempts to be a faithful Linux environment inside Windows, MSYS2 tries to use Windows APIs and to build native Windows software that works and feels good on that platform [4], plus the auto-CR conversion had been in place long enough that people were relying on it and a change would break existing scripts.

Possible Approaches

Below are two thoughts of possible solutions. Other suggestions are very welcome.

  1. Detect the script is running inside MSYS2 and optionally include the -b flag to sed.

    One of my goals is to write a portable script and sed's -b flag is not POSIX so I can't simply include it. However, I could try to detect that the script is running inside MSYS2 and then optionally include the flag.

    E.g.: printf 'foo\r\nbar\r\n' | sed ${MSYS2:+-b} -e '/bar/d' | cat -A

    Questions:

    • What is the most reliable way to detect MSYS2? I see MSYSTEM=MINGW64 and other MinGW variables but nothing specifically for MSYS2.
    • How common are older MSYS2 environments from before this change was added?
  2. Add an environment variable to MSYS2 to toggle the auto-CR conversion behavior.

    E.g.: printf 'foo\r\nbar\r\n' | MSYS2_NO_AUTOCR=1 sed -e '/bar/d' | cat -A

    In the previous discussions linked above there was a suggestion to add an environment variable that could toggle this behavior on or off. It was deemed unnecessary because there are other ways to match carriage returns, however that is not as applicable to preserving them. While I do like this idea it's not materially different from the previous suggestion if there is a bullet-proof way to identify affected MSYS2 environments. Edit: on reflection, it is materially different since it could be done once for all invocations as opposed to needing platform-specific flags for each invocation.

Steps to Reproduce the Problem

Have:

$ printf 'foo\r\nbar\r\n' | sed -e '/bar/d' | cat -A
foo$

Want:

$ printf 'foo\r\nbar\r\n' | sed -e '/bar/d' | cat -A
foo^M$

Additional Context: Operating System, Screenshots

Thanks for your time!

1480c1 commented 3 years ago

What is the most reliable way to detect MSYS2? I see MSYSTEM=MINGW64 and other MinGW variables but nothing specifically for MSYS2.

a lot of scripts use one of uname's output matching MINGW* or MSYS* to detect a mingw{64,32} "shell" and a msys2 shell

whiteinge commented 3 years ago

Thanks for the suggestion. I'm not sure why I didn't think to check uname. You're right: on my system Msys is included in the output (with the -a flag):

MINGW64_NT-10.0-19042 <hostname> 3.1.7-340.x86_64 2020-10-23 13:08 UTC x86_64 Msys

After chewing on this for a bit I realized that if my goal is portability then optionally including per-invocation flags depending on platform doesn't quite meet that goal (description edited to reflect that) since that approach could get fairly messy fairly quickly. Whereas exporting an environment variable once to enable more portable behavior is closer. What are everyone's thoughts on that approach?

1480c1 commented 3 years ago

Btw, I forgot about this one

What is the most reliable way to detect MSYS2? I see MSYSTEM=MINGW64 and other MinGW variables but nothing specifically for MSYS2.

msys2 sets MSYSTEM to MSYS, at least on my system

btw, perhaps this might be better suited for the discussions page instead of issues so we can do threading as well /cc @lazka

1480c1 commented 3 years ago

as a note, while trying to figure out where the \r\n is being "replaced" with \n, I found out that \r itself is not being completely removed when running

 printf '\r' | sed --debug 's/z/z/g' | cat -A
SED PROGRAM:$
  s/z/z/g$
INPUT:   'STDIN' line 1$
PATTERN: \r$
COMMAND: s/z/z/g$
PATTERN: \r$
END-OF-CYCLE:$
^M
1480c1 commented 3 years ago

You might want to see if https://github.com/msys2/MSYS2-packages/blob/master/sed/sed-4.4-msys-use-text-mode.patch is causing your issue