grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.85k stars 3.44k forks source link

Promtail keeps files locked #3668

Open zlepper opened 3 years ago

zlepper commented 3 years ago

Describe the bug See #2743 and #2209 both of which are closed because they are "stale", however they are very much valid bug reports, and are currently killing servers for us...

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 2.2.1
  2. Started Promtail 2.2.1 to tail out log directory recursively 'G:\LogFiles\Digizuite\dam.dev.digizuite.com\*\.log'
  3. Create a lot of logs
  4. Attempt to delete said logs manually

Expected behavior I expect to be able to delete the log files without Promtail holding a lock on them.

Environment:

mateinavidad commented 3 years ago

I am seeing a similar behaviour, i did some digging in the code, and they use the hpcloud/tail project for tailing files, which has not received any updates in the last 3 years and is considered dead.

The project was revamped under nxadm/tail which is still maintained. I created a local build replacing the hpcloud implementation with nxadm, and the needed code adjustments, but the library suffers from the same locking issue.

So the problem is the go library for emulating tail -f on windows.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

zlepper commented 3 years ago

Ping.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

zlepper commented 3 years ago

Ping.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

zlepper commented 3 years ago

Ping

stale[bot] commented 3 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

zlepper commented 3 years ago

Ping

vittoriopelle commented 2 years ago

On the same bug, need to read a file that have to be written but promtail block the file so my process can't write other log into it. I'm on windows 10 20H2.

naveenkumarsp commented 2 years ago

I too have the same problem where promtail is locking the file and not allowed to rotate or delete

perosb commented 2 years ago

Similar here that pods gets stuck in "Terminating" status due to cannot remove .log file.

C:\>.\promtail-windows-amd64.exe --version
promtail, version 2.4.2 (branch: HEAD, revision: 525040a32)
  build user:       root@5d9e7a4c92e6
  build date:       2022-01-12T16:48:53Z
  go version:       go1.16.2
  platform:         windows/amd64
perosb commented 2 years ago

I did a simple test with:

package main
import (
    "fmt"
    /*"github.com/nxadm/tail"*/
    "github.com/hpcloud/tail"
)
func main() {
  t, err := tail.TailFile("kermit.log", tail.Config{Follow: true, Poll: true, ReOpen: true, MustExist: true, Location: &tail.SeekInfo{Offset: 0, Whence: 0}})
  if err != nil {
      panic(err)
  }
  for line := range t.Lines {
      fmt.Println(line.Text)
  }
}

I'm not able to reproduce the locking of the files with that.

slim-bean commented 2 years ago

Apologies for never leaving any specific comments on this issue, but that's largely because I don't really have great news here. The problem is in how windows filesystems work.

On most linux filesystems when you open a file you get a "handle" to that file which still allows for operations like rename/delete to be performed on that file. The underlying file is kept available as long as their are processes which have handles to it. Thus you can delete a file or rename a file which is used by a process without issue on most (maybe all) nix filesystems.

Windows on the other hand is not this flexible, when a file is opened for reading it basically gets and exclusive lock which then blocks any other operations like deleting or renaming.

"Fixing" this would require either coordinating a file rotation with a restart of promtail, or I guess some kind of periodic close/re-open of files within promtail?

I'm not sure if anyone has more experience with windows and what other solutions are available but I don't think there is a simple fix to this as the issue stems from the filesystem behavior on Windows not being nearly as flexible as other filesystems...

zlepper commented 2 years ago

Apologies for never leaving any specific comments on this issue, but that's largely because I don't really have great news here. The problem is in how windows filesystems work.

On most linux filesystems when you open a file you get a "handle" to that file which still allows for operations like rename/delete to be performed on that file. The underlying file is kept available as long as their are processes which have handles to it. Thus you can delete a file or rename a file which is used by a process without issue on most (maybe all) nix filesystems.

Windows on the other hand is not this flexible, when a file is opened for reading it basically gets and exclusive lock which then blocks any other operations like deleting or renaming.

"Fixing" this would require either coordinating a file rotation with a restart of promtail, or I guess some kind of periodic close/re-open of files within promtail?

I'm not sure if anyone has more experience with windows and what other solutions are available but I don't think there is a simple fix to this as the issue stems from the filesystem behavior on Windows not being nearly as flexible as other filesystems...

That is not entirely correct. When you open a file on Windows you can specify the sharing mode, aka what other processes and such should be able to do, and one of these modes does allow for deletion, without breaking the current process.

slim-bean commented 2 years ago

That is not entirely correct. When you open a file on Windows you can specify the sharing mode, aka what other processes and such should be able to do, and one of these modes does allow for deletion, without breaking the current process.

Ooooh, interesting, I wonder now how/if we can do this from Go

MasslessParticle commented 2 years ago

@zlepper Looks like the library is already doing this: https://github.com/grafana/loki/blob/5e134f3b6d33b2f9b485cde7a70f160704eae5ba/vendor/github.com/hpcloud/tail/winfile/winfile.go#L39

MasslessParticle commented 2 years ago

This has historically been a known issue with golang and we've verified that our tailing library implements all the necessary workarounds to not lock files.

I've tested across versions of promtail and golang

I've attempted to reproduce this on a windows machine without luck in the following ways:

I got a locking error when I tailed a directory and continuously wrote to a log file in it. When I tried to delete the file, Windows complained that promtail held a lock even though another process was holding it open. I don't think this is our exact issue, however, because the file was never in the DeletePending state.

I am unable to get a file stuck in the DeletePending state

I think we have an experiment we can do where promtail checks for a DeletePending when tailing but we'll need help to test it. @zlepper @perosb any chance you or anyone else may be able to test?

zlepper commented 2 years ago

@MasslessParticle I'm pretty sure I can still reproduce the issue on one of our internal servers, so yes, I would definitely be up to help testing.

slim-bean commented 2 years ago

@zlepper quick question, how does your file rotation work on the affected servers? Is it the process that generates the files that does the rotation or is there an external process which does it?

zlepper commented 2 years ago

@slim-bean it's the same process. Specifically it creates the file, writrs some stuff, at some point discovers that the file is over a configured threshold, at which point it closes the file, and moves on to a new file. After it has moved to the new file it does a file listing in the active log directory, and then attempts to delete older log files until the file count is below a configured limit.

MasslessParticle commented 2 years ago

@zlepper I've got an experimental branch with a change to tail that should detect a file in DeletePending state and close it. I've tested it on my Windows machine to verify that it doesn't break anything but can't reproduce the issue so I don't know if it actually fixes anything.

https://github.com/grafana/loki/compare/main...tpatterson/promtail-windows-experiment

Are you setup to build go projects?

zlepper commented 2 years ago

@zlepper I've got an experimental branch with a change to tail that should detect a file in DeletePending state and close it. I've tested it on my Windows machine to verify that it doesn't break anything but can't reproduce the issue so I don't know if it actually fixes anything.

https://github.com/grafana/loki/compare/main...tpatterson/promtail-windows-experiment

Are you setup to build go projects?

Yes, I can setup and build locally, will try your branch and see what happens.

zlepper commented 2 years ago

@MasslessParticle Of cause I can't reproduce the original locking issue at all right now :/ Just wanted to verify before I tested the branch you provided. I will keep trying during the week, just so you know i haven't forgotten.

MasslessParticle commented 2 years ago

@zlepper Thank you!

tiger5226 commented 2 years ago

@zlepper / @MasslessParticle We are currently experiencing this problem in production causing logs to backup on servers. We can also deploy the build and test it out as well. It is a continuous problem, so it should reproduce pretty quickly.

MasslessParticle commented 2 years ago

@tiger5226 Any testing would be much appreciated!

tiger5226 commented 2 years ago

Hi Team,

The issue persists with the branch @MasslessParticle

We have had it running in production for a few days now. This morning we noticed that the folder size of the log files was increasing from the starting point. So we tested a restart and sure enough upon restart the log files were removed. So it is still holding on to them with the changes.

As a temporary mitigation we are going to see if we can run in the task scheduler and basically restart it daily, to see if this reduces the build up of log files. The chief concern here is that the files build up so much that the server runs out of disk space. This happened to us in production already, which is why I found this issue thread.

Any other suggestions? Happy to help out here, like if you come up with a debug build etc.

MasslessParticle commented 2 years ago

@tiger5226 Interesting, I wonder what's unique about your environment. Can you check if the file is in a DeletePending state of the file when it's stuck?

tiger5226 commented 2 years ago

ok, before putting the auto restart in, we will check for DeletePending state.

tiger5226 commented 2 years ago

@MasslessParticle We let it run for a couple of days and checked with procmon the state of the files. They are indeed in DeletePending state. When we restart the binary they get deleted.

tiger5226 commented 2 years ago

Another interesting note, I do believe environment may have an impact here. Even without this PR, there are machines that do not have this problem. It is not universal. Before you ask, we tried to find a commonality but the machines are nearly identical and software producing the log files is the same as well. A bit of needle in a haystack trying to find a meaningful difference. If you have any clues around this specific issue, I am happy to investigate further.