krzysztofslusarski / continuous-async-profiler

Spring boot library for continuous profiling with async-profiler
Apache License 2.0
30 stars 5 forks source link

Safe compress/delete in multi process environment #30

Open michaldo opened 11 months ago

michaldo commented 11 months ago

Consider case when application is running on Kubernetes. Number of pods may vary

With output file prefix equal to pod id, pod output files will not collide. However, it is possible that 2 pods start a compression or deletion at the same time and they may modify the same files. I think some kind of locks is required.

File API has a locking, but I heard behavior is OS specific. I'm also afraid that File API lock may not work in cloud volume world. Do you agree? Other option is use file like /.lock as a semaphore. Can you share a best practice how to synchronize Java processes over semaphore file?

krzysztofslusarski commented 11 months ago

The easiest way to avoid collision is to set:

To different paths. The path can contain pod id ant that solves the problem I believe.

Any locks on filesystem are hard to manage. You need to find a solution for "node that held the lock was killed" and so on. If you really need a functionality to save files to same directory from multiple JVMs then I believe this is better solution:

The library is aware of files that are generated:

        return String.format(
                "jfr,event=%s%s,file=%s/%s-%s.jfr",
                event,
                additionalParameters,
                notManageableProperties.getContinuousOutputDir(),
                event,
                date
        );

We can add in memory (concurrent) collection that will store files generated by single JVM. On that basis we can compress/delete/move to archive.

michaldo commented 11 months ago

The easiest way to avoid collision is to set (output dirs) to different paths. The path can contain pod id ant that solves the problem I believe.

It is wrong idea to have directory per (temporal by nature) pod. There will be plenty of directories, hard to manage, hard to select by time. When node is deleted, nobody will care to clean its output files, because its directory will not be assigned to any live pod.

You need to find a solution for "node that held the lock was killed" and so on.

That should not be so hard, lock may have a time limit. I bet this problem is already solved, but I could find a solution.

Anyway, for now I think best option is left compression and deletion unsafe. In worst case some profiler output files will be broken - acceptable.