Open zlepper opened 3 years ago
I am seeing a similar behaviour, i did some digging in the code, and they use the hpcloud/tail project for tailing files, which has not received any updates in the last 3 years and is considered dead.
The project was revamped under nxadm/tail which is still maintained. I created a local build replacing the hpcloud implementation with nxadm, and the needed code adjustments, but the library suffers from the same locking issue.
So the problem is the go library for emulating tail -f on windows.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Ping.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Ping.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Ping
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
Ping
On the same bug, need to read a file that have to be written but promtail block the file so my process can't write other log into it. I'm on windows 10 20H2.
I too have the same problem where promtail is locking the file and not allowed to rotate or delete
Similar here that pods gets stuck in "Terminating" status due to cannot remove .log file.
C:\>.\promtail-windows-amd64.exe --version
promtail, version 2.4.2 (branch: HEAD, revision: 525040a32)
build user: root@5d9e7a4c92e6
build date: 2022-01-12T16:48:53Z
go version: go1.16.2
platform: windows/amd64
I did a simple test with:
package main
import (
"fmt"
/*"github.com/nxadm/tail"*/
"github.com/hpcloud/tail"
)
func main() {
t, err := tail.TailFile("kermit.log", tail.Config{Follow: true, Poll: true, ReOpen: true, MustExist: true, Location: &tail.SeekInfo{Offset: 0, Whence: 0}})
if err != nil {
panic(err)
}
for line := range t.Lines {
fmt.Println(line.Text)
}
}
I'm not able to reproduce the locking of the files with that.
Apologies for never leaving any specific comments on this issue, but that's largely because I don't really have great news here. The problem is in how windows filesystems work.
On most linux filesystems when you open a file you get a "handle" to that file which still allows for operations like rename/delete to be performed on that file. The underlying file is kept available as long as their are processes which have handles to it. Thus you can delete a file or rename a file which is used by a process without issue on most (maybe all) nix filesystems.
Windows on the other hand is not this flexible, when a file is opened for reading it basically gets and exclusive lock which then blocks any other operations like deleting or renaming.
"Fixing" this would require either coordinating a file rotation with a restart of promtail, or I guess some kind of periodic close/re-open of files within promtail?
I'm not sure if anyone has more experience with windows and what other solutions are available but I don't think there is a simple fix to this as the issue stems from the filesystem behavior on Windows not being nearly as flexible as other filesystems...
Apologies for never leaving any specific comments on this issue, but that's largely because I don't really have great news here. The problem is in how windows filesystems work.
On most linux filesystems when you open a file you get a "handle" to that file which still allows for operations like rename/delete to be performed on that file. The underlying file is kept available as long as their are processes which have handles to it. Thus you can delete a file or rename a file which is used by a process without issue on most (maybe all) nix filesystems.
Windows on the other hand is not this flexible, when a file is opened for reading it basically gets and exclusive lock which then blocks any other operations like deleting or renaming.
"Fixing" this would require either coordinating a file rotation with a restart of promtail, or I guess some kind of periodic close/re-open of files within promtail?
I'm not sure if anyone has more experience with windows and what other solutions are available but I don't think there is a simple fix to this as the issue stems from the filesystem behavior on Windows not being nearly as flexible as other filesystems...
That is not entirely correct. When you open a file on Windows you can specify the sharing mode, aka what other processes and such should be able to do, and one of these modes does allow for deletion, without breaking the current process.
That is not entirely correct. When you open a file on Windows you can specify the sharing mode, aka what other processes and such should be able to do, and one of these modes does allow for deletion, without breaking the current process.
Ooooh, interesting, I wonder now how/if we can do this from Go
@zlepper Looks like the library is already doing this: https://github.com/grafana/loki/blob/5e134f3b6d33b2f9b485cde7a70f160704eae5ba/vendor/github.com/hpcloud/tail/winfile/winfile.go#L39
This has historically been a known issue with golang and we've verified that our tailing library implements all the necessary workarounds to not lock files.
I've tested across versions of promtail and golang
I've attempted to reproduce this on a windows machine without luck in the following ways:
I got a locking error when I tailed a directory and continuously wrote to a log file in it. When I tried to delete the file, Windows complained that promtail held a lock even though another process was holding it open. I don't think this is our exact issue, however, because the file was never in the DeletePending
state.
I am unable to get a file stuck in the DeletePending
state
I think we have an experiment we can do where promtail checks for a DeletePending when tailing but we'll need help to test it. @zlepper @perosb any chance you or anyone else may be able to test?
@MasslessParticle I'm pretty sure I can still reproduce the issue on one of our internal servers, so yes, I would definitely be up to help testing.
@zlepper quick question, how does your file rotation work on the affected servers? Is it the process that generates the files that does the rotation or is there an external process which does it?
@slim-bean it's the same process. Specifically it creates the file, writrs some stuff, at some point discovers that the file is over a configured threshold, at which point it closes the file, and moves on to a new file. After it has moved to the new file it does a file listing in the active log directory, and then attempts to delete older log files until the file count is below a configured limit.
@zlepper I've got an experimental branch with a change to tail that should detect a file in DeletePending
state and close it. I've tested it on my Windows machine to verify that it doesn't break anything but can't reproduce the issue so I don't know if it actually fixes anything.
https://github.com/grafana/loki/compare/main...tpatterson/promtail-windows-experiment
Are you setup to build go projects?
@zlepper I've got an experimental branch with a change to tail that should detect a file in
DeletePending
state and close it. I've tested it on my Windows machine to verify that it doesn't break anything but can't reproduce the issue so I don't know if it actually fixes anything.https://github.com/grafana/loki/compare/main...tpatterson/promtail-windows-experiment
Are you setup to build go projects?
Yes, I can setup and build locally, will try your branch and see what happens.
@MasslessParticle Of cause I can't reproduce the original locking issue at all right now :/ Just wanted to verify before I tested the branch you provided. I will keep trying during the week, just so you know i haven't forgotten.
@zlepper Thank you!
@zlepper / @MasslessParticle We are currently experiencing this problem in production causing logs to backup on servers. We can also deploy the build and test it out as well. It is a continuous problem, so it should reproduce pretty quickly.
@tiger5226 Any testing would be much appreciated!
Hi Team,
The issue persists with the branch @MasslessParticle
We have had it running in production for a few days now. This morning we noticed that the folder size of the log files was increasing from the starting point. So we tested a restart and sure enough upon restart the log files were removed. So it is still holding on to them with the changes.
As a temporary mitigation we are going to see if we can run in the task scheduler and basically restart it daily, to see if this reduces the build up of log files. The chief concern here is that the files build up so much that the server runs out of disk space. This happened to us in production already, which is why I found this issue thread.
Any other suggestions? Happy to help out here, like if you come up with a debug build etc.
@tiger5226 Interesting, I wonder what's unique about your environment. Can you check if the file is in a DeletePending
state of the file when it's stuck?
ok, before putting the auto restart in, we will check for DeletePending state.
@MasslessParticle We let it run for a couple of days and checked with procmon the state of the files. They are indeed in DeletePending state. When we restart the binary they get deleted.
Another interesting note, I do believe environment may have an impact here. Even without this PR, there are machines that do not have this problem. It is not universal. Before you ask, we tried to find a commonality but the machines are nearly identical and software producing the log files is the same as well. A bit of needle in a haystack trying to find a meaningful difference. If you have any clues around this specific issue, I am happy to investigate further.
Describe the bug See #2743 and #2209 both of which are closed because they are "stale", however they are very much valid bug reports, and are currently killing servers for us...
To Reproduce Steps to reproduce the behavior:
Expected behavior I expect to be able to delete the log files without Promtail holding a lock on them.
Environment: