go-gitea / gitea

Git with a cup of tea! Painless self-hosted all-in-one software development service, including Git hosting, code review, team collaboration, package registry and CI/CD
https://gitea.com
MIT License
45.27k stars 5.51k forks source link

Gitea file view showing wrong amount of lines for both Unix and Windows/DOS line breaks but differently #27377

Open ell1e opened 1 year ago

ell1e commented 1 year ago

Description

(I was occasionally abrasive in some comments, I apologize. :cherry_blossom: )

Gitea shows a wrong amount of lines for both Unix and Windows/DOS line breaks in the web UI in both edit and code view, and in a very different and inconsistently wrong manner.

:warning: Please note most Unix code editors get Windows/DOS line breaks wrong, including vim and the wc -l command, which counts line breaks characters and not lines, so it doesn't give correct line counts for Windows/DOX files. Get the correct lines displayed reliably: always check Windows/DOS \r\n line break files with notepad.exe on Windows, and always check Unix \n line break files with vim on Linux to show the correct amount of lines there. Or check with a hex editor what's really going on per file.

Gitea Version

1.22.0+dev-74-g4ab597f47

Can you reproduce the bug on the Gitea demo site?

Yes

Log Gist

No response

Screenshots

No response

Git Version

whatever is used on try.gitea.io

Operating System

whatever is used on try.gitea.io

How are you running Gitea?

it's on try.gitea.io

Database

None

wxiaoguang commented 1 year ago

The behavior has been documented in code:

image

And see:

ell1e commented 1 year ago

Well all I can say is that it's wrong on 1.22.0+dev-74-g4ab597f47, at the exact locations and with the exact files I described and also linked above. I suggest to click the links and see for yourself, that should show it very well.

(Note added later: I was occasionally abrasive in some comments, I apologize. :cherry_blossom: )

wxiaoguang commented 1 year ago
~$ curl -s https://try.gitea.io/blablablablab/BlaTest/raw/branch/main/unix.c | xxd
00000000: 6165 6a69 6165 6c69 7261 2074 6869 7320  aejiaelira this
00000010: 6669 6c65 2069 7320 6120 7369 6e67 6c65  file is a single
00000020: 206c 696e 650a                            line.

According to the definition, it has two lines, the second line is an empty line.

image

~$ curl -s https://try.gitea.io/blablablablab/BlaTest/raw/branch/main/windows.c | xxd
00000000: 0d0a 6162 6320 7468 6973 2066 696c 6520  ..abc this file
00000010: 6973 2033 206c 696e 6573 0d0a            is 3 lines..

According to the definition, it has three lines, the first line is empty, the third line is empty.

image


Everything is as expectation.

wxiaoguang commented 1 year ago

The reason for using 2 lines but not 1 line for a\n:

ell1e commented 1 year ago

According to the definition, it has two lines, the second line is an empty line.

The hex dump you gave for the first file ends in 0x65 0x0A as you can see yourself, which is equivalent to an ending of "e\n". There is no other line break in that hex dump. Therefore no, it's one line, ending in a Unix line terminator which is how Unix line terminators work. (I didn't make the rules.) So this file is 1 line. Edit: sorry, had the wrong number here at first.

In your second hex dump, you can see two instances of 0x0d 0xDA, which is equivalent to "\r\n". Two line separators, which splits the file into three lines, according to how Windows line separators work. So this file is 3 lines.

It's how these files natively work on the respective systems. Therefore, either the code view, or the edit view, is wrong on both of these files in some circumstances, and the line count you posted for the first file is wrong as well. Also in your own screenshot, the third file shows as "3 lines", yet only 2 are shown in the listing below, which seems incorrect.

wxiaoguang commented 1 year ago

\n in POSIX is equivalent to \r\n in Windows, they are all EOL.

So if you can accept that the windows.c has 3 lines, I think you could also accept that the unix.c has 2 lines.

The rules are open. You could propose your rules and PRs to improve the behavior, if most people like, and the content a/a\n could be distinguished clearly.

ell1e commented 1 year ago

This is the only way to distinguish content a (1 line) and a\n (2 lines) at the moment.

According to Unix file semantics, I'm pretty sure these are both considered 1 line. The first one would be an incorrectly unterminated single line, but nevertheless, these are both just 1 line. I think GitHub shows some sort of icon to the side if the terminator for Unix files is incorrectly missing.

(Note added later: I was occasionally abrasive in some comments, I apologize. :cherry_blossom: )

wxiaoguang commented 1 year ago

. I think GitHub shows some sort of icon to the side if the terminator for Unix files is incorrectly missing.

That's also what I have proposed in https://github.com/go-gitea/gitea/pull/19967#issuecomment-1184115968 .

ell1e commented 1 year ago

So if you can accept that the windows.c has 3 lines, I think you could also accept that the unix.c has 2 lines.

I showed the screenshot. \n on Linux is considered a line terminator, so no new line is considered to start after it, you can see this in vim.

On Windows, \r\n is considered a line separator, see in notepad.exe.

It's just what some people making these systems decided on, so it's really not my fault.

ell1e commented 1 year ago

Screenshot from 2023-10-01 16-56-25

Maybe this helps, in GNOME's gedit text editor on Linux it's easier to see that it also considers this file just 1 line. (The file ending in \n as a Unix line terminator.) This is for unix.c.

And Notepad.exe should be considered authoritative and hopefully easy to see on Windows as well that it has 3 lines for the windows.c file, here it is again:

wxiaoguang commented 1 year ago

I do not understand how "vim" / "gedit" / "notepad" could be considered as "standard". They just have their quirks.

Have you tried VSCode? As a developer, I like VSCode's behavior more:

image

ell1e commented 1 year ago

VS Code is incorrect, which is why I showed the native, established editors for each respective platform. Most cross-platform editors (like VS Code) handle one of the two platforms incorrectly.

wxiaoguang commented 1 year ago

Gitea is also designed for cross-platform. And personally I consider VSCode is right, vim is right, notepad is right, no one is incorrect, they just have their quirks.

Anyway, the problem is open, while personally I do think current approach is good enough. Feel free to propose new rules and PRs.

ell1e commented 1 year ago

I did some quick mockups to show what I think the correct behavior would look like.

File view unix.c fixed:

Screenshot from 2023-10-01 17-03-18

Edit view unix.c fixed:

Screenshot from 2023-10-01 17-06-05

File view windows.c fixed:

Screenshot from 2023-10-01 17-00-46

Edit view windows.c fixed (was already correct):

Screenshot from 2023-10-01 17-05-23

ell1e commented 1 year ago

Well, one last comment but in my opinion it's pretty telling: VS Code was created by Microsoft who primarily deal with windows, and surprise (or rather, not surprise) it shows the \n non-Windows files wrongly. VIM was created on Unix, and shows \r\n files wrong. In my opinion, the only rule that makes sense is to make it display what the native established editors and tools do, as found on the platform native to the line ending type.

There is no other way to make it more consistent. You'll always find some special editors, especially the cross-platform ones that originated on a different platform, that will show some platform's file endings slightly incorrect. Simply because this is a pretty common bug.

I'm just trying to help out getting it fixed, I understand that a lot of programs get this wrong.

wxiaoguang commented 1 year ago

VS Code was created by Microsoft who primarily deal with windows, and surprise (or rather, not surprise) it shows the \n non-Windows files wrongly.

That's not true. All modern editors have consistent behavior: treating "\n" or "\r\n" as EOL. They are the same.

VIM was created on Unix, and shows \r\n files wrong.

VIM just has too many technical debts, it just doesn't want to change its behavior for handling "\r"

the only rule that makes sense is to make it display what the native established editors and tools do, as found on the platform native to the line ending type

Have you considered about a file which contains both \n and \r\n?

There is no other way to make it more consistent. You'll always find some special editors, especially the cross-platform ones that originated on a different platform, that will show some platform's file endings slightly incorrect. Simply because this is a pretty common bug.

It is consistent for modern editors treating "\r\n" and "\n" equivalently.

And have you tried the Linux world famous editor emacs?

image

eeyrjmr commented 1 year ago

That standard behaviour when using git on windows is during the install is to select the option with the most compatibility

"Checkout Windows-Style, Commit Unix-Style line endings"

The moment you have cross-OS users committing to the same repository this is the PRIMARY option that windows users MUST select to keep the repository in a sane consistent state between users.

Can you confirm you have this set ( core.autocrlf = true )

ell1e commented 1 year ago

All modern editors have consistent behavior: treating "\n" or "\r\n" as EOL. They are the same.

You're not saying what EOL means, I suspect you mean line separator, so actually not EOL/end of line?

Beyond vim and GNOME Text editor, it's all the basic command line tools like cat, wc -l, and so on that also all think that \n is a line terminator, not a separator, for Unix/Linux files. I feel like maybe the basic core tools have more weight. Emacs seems to have been based on a cross-platform one as well rather than one written originally for Unix primarily, so I feel like it doesn't count. Otherwise at least make it consistent, gitea file view violates that. Edit: shortened for brevity.

@eeyrjmr this isn't really related, I think. Open the files in a hex editor for the byte contents.

ell1e commented 1 year ago

After some more thinking, I guess I can see a point for going with a modern same interpretation.

wxiaoguang commented 1 year ago

I feel like maybe the basic core tools have more weight with being correct than all these modern cross-platform editors that may have cared less about getting such details right?

There is no standard in "basic core" either. You see:

$ echo -n a > a.txt
$ wc -l a.txt
0 a.txt
# while "vim a.txt" still shows 1 line

There is no standard, no official rule.

Emacs seems to have been based on a cross-platform one as well rather than one written originally for Unix primarily, so I feel like it doesn't count.

According to various public information, emacs was initially designed for POSIX, the native tool.

gitea violates this on the file view

Since there is no standard, then no violation.

eeyrjmr commented 1 year ago

@ell1e it is related though as this is the correct mitigation. The bare git repository (irrespective of the forge) must be consistent and that is why on windows gitforwin recommends (checkout windows, checkin unix) as this way the bare host, linux checkout and windows checkout all appear consistent.

There is no standard except at the OS level and thus it must be mitigated. Sure some editors will fluff this to appear "consistent" but it doesn't change what is at byte level

ell1e commented 1 year ago

Why are you using echo -n? That explicitly strips the last \n so that'll give you a wrong result.

emacs was initially designed for POSIX, the native tool.

emacs was apparently based on the TECO editor which wasn't for primarily POSIX or Unix, if i'm not reading it wrong. Maybe it kept the different handling for compatibility, I wouldn't know.

ell1e commented 1 year ago

After some searching I found that POSIX actually seems to define it, so there even is a standard:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

3.206 Line

A sequence of zero or more non- characters plus a terminating character.

That makes "aejiaelira this file is a single line\n" 1 line. Does this help? I hope that windows.c is 3 is undisputed.

ell1e commented 1 year ago

(Edit: sorry that was a too strong wording by me, and a misunderstanding on my side anyway. I was occasionally abrasive in some comments, I apologize. :cherry_blossom: )

Since there is no standard, then no violation.

I don't understand that. Gitea agreeing with itself how many files a line has is a too high bar? I'm confused.

wxiaoguang commented 1 year ago

Why are you using echo -n? That explicitly strips the last \n so that'll give you a wrong result.

I do not think it is "wrong". It's just a normal file without last trailing \n. I was just showing the real line-counting problem for "incomplete line". You can also have files like hello world\nnice day or def test():\n print('test'), which do not have the last trailing EOL.

And see below.

After some searching I found that POSIX actually seems to define it, so there even is a standard:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

3.206 Line A sequence of zero or more non- characters plus a terminating character.

That makes "aejiaelira this file is a single line\n" 1 line. Does this help? I hope that windows.c is 3 is undisputed.

I wouldn't say this is a feasible standard for "line-counting display".

For example: for a file content which contains an "Incomplete Line", like this: a\nb, the first line is a\n, the second incomplete line is b. Do you want to see "1 line" or "2 lines"?

I hope that windows.c is 3 is undisputed.

The behavior for POSIX and Windows should be the same. Otherwise I do not see it useful for cross-platform, for example:

What if you have a file which contains both \n and \r\n? Do you want to see it as a POSIX file or a Windows file.

I believe they should have the same behavior: only consider the \n or \r\n as line-separator. If people don't like the last empty line, then do not render it (see below).

Since there is no standard, then no violation.

I don't understand that. Gitea agreeing with itself how many files a line has is a too high bar? I'm confused.

There is no "too high bar", I have explained above: so far so good, good enough, for most cases.

The Gitea's definition in code is: empty: 0 lines; "a": one line; "a\n": two lines; "a\nb": two lines. So Gitea only considers the \n / \r\n (EOL) as line-separator: there are a line before it and a line after it. Then, the last \n starts an empty new line. That's Gitea's definition (and that's also most modern editors' definition), it's clear enough.

And why Gitea doesn't show the last empty line in UI: Gitea intentionally hides the last empty line when rendering the content because it makes most users happy and makes the UI better. So it won't render the last empty line for Windows files either.

Gitea only reports the existence of the last empty new line by "xxx lines" (one more line then rendered) at the moment, so users could still know whether there is an empty last line.

And yes, there are still some improvements to do, eg: use an icon mark to indicate the last "incomplete line", then Gitea could report the rendered lines as line-counter, then it will be more consistent.

ell1e commented 1 year ago

For example: for a file content which contains an "Incomplete Line", like this: a\nb, the first line is a\n, the second incomplete line is b. Do you want to see "1 line" or "2 lines"?

This is where github shows the "missing line terminator" icon, I think (and both lines).

For a mixed file you'd probably also add an indicator of some kind, at the top of the file or something.

As for Gitea right now, it shows different line counts and lines, and even different lines in file view and editor, for the same file. No matter how many lines a file should have I think that's the worst of all worlds, personally.

ell1e commented 1 year ago

If you want to see "2 lines" (that's also what vim shows), then your referred standard is not applicable.

After re-reading, I'm confused by the remark. POSIX basically says a\n is 1, and a\nb\n is 2. It just doesn't say what a\nb is, and as you say yourself, most editors seem to agree it's 2. Where's the problem? (And Windows' Notepad.exe says a\r\nb\r\n is 3, and a\r\nb is 2, also pretty clear.)

silverwind commented 8 months ago

I think this is a easy fix: Show the number of lines that are rendered, which also matches GitHub's line count. Trailing newlines are never counted, regardless whether they are LF or CRLF. So in JS code it would be str.replace(/\r?\n$/g, '').split(/\r?\n/).length.