Run cleaner concurrently also inside one archive

pmoravec commented 1 year ago

sos report --cleaner cleans the generated report sequentially using one thread, what prolongs its already "naturally slow" execution. We should run it concurrently, similarly like sos report executes plugins.

I see two approaches here:

run individual parsers (IP, MAC, username,..) concurrently, with a lock on each file they update. This is required to prevent concurrent modification of the same file by multiple parsers.
or for each parser running sequentially, clean files in parallel

Currently, cleaner iterates over all files and runs individual parsers sequentially for one file after another. That is quite away of either above approach.

Also, sos clean --jobs=4 declares concurrency on individual archives that can be cleaned in parallel. This additional concurrency (with assumed 4 workers as default) should be aligned with that one - we should not end up with 4*4 concurrent workers running in parallel.

Despite there are the above obstacles to the proposal, I think the justification for the improvement is solid: the use cases when cleaner is run on a sequential archive dominate and the improvement would speed them up significantly.

TurboTurtle commented 1 year ago

I took a stab at moving our concurrency to process-based a while back and the direction I took there was to try and assign each archive to a specific process. There were a few challenges there - in no particular order:

We need to keep the parsers sync'd across processes, so that obfuscations are kept consistent across both archives and files.
Because of this, we need to use shared memory objects for holding those obfuscations. For multiple archives this is fine, but in the case of single archive cleaning it actually significantly harms performance.
Logging is performed via the archive abstractions - but logging streams are not able to be serialized and so the pickling that occurs when forking out to a new process fails. There was/is some ambiguity here as newer versions of python don't seem to have the same problem (the new process initializes without issue) but core python maintainers have stated that shouldn't be happening.

That being said, I'm not of the opinion that we have to have parallelism at the archive level. If we instead move to file-level parallelism, that's fine - we just need to be able to report when we've finished the last file in an archive and when we've started a new archive's file list.

This also doesn't have to be with a specific library, but it does have to be available for python-3.6 for our downstreams that use that version.

NikhilKakade-1 commented 1 year ago

@TurboTurtle @pmoravec , any new developments to fix it?

pmoravec commented 1 year ago

@TurboTurtle @pmoravec , any new developments to fix it?

Nothing I am aware of..

NikhilKakade-1 commented 10 months ago

@TurboTurtle @pmoravec , any update? Is 3262 some how related to it ?

TurboTurtle commented 10 months ago

@NikhilKakade-1 the short update is - no update.

I've since left RH and my time to dedicate to sos is as such much reduced. Before I left however, there was some discussion on moving to a queue-based systems for obfuscations. The basic idea is that we'd have multiple processes (as opposed to threads) spawned which trawl through an archive/file looking for matches via the parsers. On a match, an item is added to a queue running in the main thread which handles the obfuscation, and then returns the obfuscated match to the process waiting for it.

While mostly straight forward in theory, this is a major change to implement and test. I am not sure when (or if) I'd have the time available to really dedicate to this effort myself in the near future. We (as in the sos project) would likely need resource(s) from our commercial sponsors (namely RH/Canonical) to be able to get this kind of work hammered out in a reasonable timeframe.

NikhilKakade-1 commented 9 months ago

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

pmoravec commented 9 months ago

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

No promise here. We might have some more resource for sos development in general, but esp. for such major change, I dont want to raise expectations we dont need to meet.

arif-ali commented 4 months ago

OK, Thanks !!

@pmoravec, @arif-ali anything from your side on the above comment?

I would love to see better parallelism and concurrency, but I don't have bandwidth to take up on anything like that. But, as it has been raised this would require significant change

ijajmulani commented 3 months ago

any update on this issue. I'm also facing is issue.

pmoravec commented 3 months ago

any update on this issue. I'm also facing is issue.

Apart of some sporadic discussion at https://github.com/orgs/sosreport/discussions/3476 , no update. No final consensus on the way of implementation (Jake's original idea is followed by two other ones), no work force for it.

NikhilKakade-1 commented 2 months ago

@TurboTurtle / @pmoravec , To address this, I've been contemplating the implementation of a wrapper script.

The script starts by retrieving a list of enabled plugins using the sos report -l command. It then proceeds to execute each plugin individually, utilizing the -o flag along with the --clean option, allowing for concurrent execution. This approach facilitates direct cleaning of the files collected by each plugin, potentially streamlining the data collection and obfuscation process. With this setup, there's no need to wait for the completion of each plugin before initiating the cleaning process. Additionally, dividing the process into multiple plugins enables parallelism on a batch basis, further enhancing efficiency. Afterward, the resulting multiple tarballs can be consolidated into a single tarball. However, progress is currently hindered by an issue identified as #3433.

I'm looking forward to receiving your inputs and suggestions

Thanks

TurboTurtle commented 2 months ago

That wouldn't be a design that gets picked up by us. To many moving parts, and the reassembly of potentially dozens of archives will, by nature, always be fragile.

Also, as you note, the obfuscation of an archive is based on the contents of the archive. If we want to obfuscate hostnames, we need to know the hostname of the system so we can identify other hostnames we would also want to obfuscate based on the FQDN - and that requires either the information to be present in the archive, to be passed by the user at runtime, or for that information to already be recorded in an obfuscation map in /etc/sos/cleaner/default (or similar saved mapping location). This doesn't lend itself well to wrappers that attempt to fragment the workflow.

I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads.

To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat.

NikhilKakade-1 commented 2 months ago

I think we can have the require file/cmd output at runtime with --clean, https://github.com/sosreport/sos/issues/3433#issuecomment-2048305532

pmoravec commented 2 months ago

I still believe that the best path forward is what I describe, admittedly at a high level, in the discussion on cleaner performance - we should leverage the native concurrency features of python across multiple processes instead of threads.

To be blunt, this work needs to be focused on by someone who is very familiar with process-based concurrency in python and who is able to commit full-time development to it for a decent stretch of time. I myself have not been able to do so since leaving Red Hat.

What about the alternative approach I suggested in the discussion? The processes-based concurrency approach might be most elegant solution but we strive finding a volunteer for it for approx one year, unsuccessfully. Implementing my approach should be much easier (please correct me if my estimation is wrong) and final solution can be still acceptable as a long term solution. Well, with two potential gotchas:

1) some UX regression with moving from obfuscated strings like host7 to strings like host-4B9F4B0B that is less human readable 2) Using salt in clustered environment

I don't insist on this idea. I am in for whatever is feasible from main angles (sufficient solution, robust enough, somebody who will implement it, reasonably user-friendly). I am just requesting some assessment or discussion.

NikhilKakade-1 commented 2 months ago

https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/__init__.py#L97

The time complexity of the provided function becomes:

Worst Case: O(k n m) Best Case: O(k * m)

In my opinion, optimizing the data masking process is essential. In terms of computational complexity, the worst-case scenario is O(k n m), where k is the number of lines, n is the number of compiled regexes in self.mapping.compiled_regexes, and m is the average length of the lines. Conversely, the best case is O(k * m).

I've been mulling over a few questions.

Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?
What factors should be considered when deciding between processing find and replace operations line by line versus in batches?

pmoravec commented 2 months ago

The complexity theory approach reminds me my university studies :)

We dont have any cleanup of mapping files. It can have sense, but how/when to decide cleanup should be done? Every month? When mapping is bigger than X records? Remove selective records when they have not been found&replaced in past Y cleaner runs? Each heuristic has its pros and negs. Anyway that is independent topic (that I am definitely not against it) to the limited cleaner concurrency.

Anyway, the number of compiled regexes is usually the much much smallest number in that equation. We might better try preventing cleaners to touch files we know as "safe" (https://github.com/sosreport/sos/blob/main/sos/cleaner/__init__.py#L775-L780), and optionally skip processing files if no parser is applicable. As e.g. can /var/lib/selinux contain any sensitive info? Or /sys/fs/cgroup, /proc/sys or /sys/kernel? I expect some of the dirs can contain some sensitive info, I am not an expert on either, and we should be rather conservative than risk we skip obfuscating anything here, but maybe this is also a room for improvement..? Again, topic independent on the original "make cleaner (more) concurrent".

TrevorBenson commented 1 week ago

I've been mulling over a few questions.

Should we consider periodic cleanup of mapping files as the number of compiled regexes increases over time, potentially impacting total processing time?

Yes? However, retaining the same obfuscation is very useful in practice.

Removing selective records not seen in a long time as @pmoravec mentions is one option that could be viable. This presents separate issues, however, like tracking the "temperature" or last-seen timestamp for each obfuscation. This should only be added as a non-default option, allowing the choice of when to enable a cleanup.

Separately

I like the deterministically computed hash approach @pmoravec discusses in #3476_comment-8501581, especially if a solution to the problem of using separate salts per host in a single collection is solved. As mentioned, without solving this it presents a problem for a general analysis of separate archives in an sos collection.

The difficulty can also increase exponentially when analyzing cluster configurations, especially for clusters that continue to grow the number of total members. A different salt between separate collections would also make it quite difficult to compare collections from different times against one another, which can also be useful in practice.

pmoravec commented 1 week ago

using separate salts per host in a single collection is solved.

Assuming salt is somehow stored on each host (and just not shared in sosreport, of course), sos collect --clean can read the salt from each host before (or after?) collection of the sosreport archive.

This still does not solve use cases like:

"generate sosreport on one host, then run cleaner on another", where salt from another host is used to sosreport from the first host
sos collect --clean fetches a sosreport from a host that has not generated salt yet as no cleaner has been run there, so a local (new?) salt is used for that sosreport; later on the host runs sos clean that generates a new different salt..? (this has solution: when collector detects a host does not have a salt, it generates one and saves it for the host)

There can be other scenarios I haven't imagined, which makes the idea bit fragile. Thats the cons, compared to the Queue() idea described in https://github.com/orgs/sosreport/discussions/3476#discussioncomment-8225987 . Pros is much simplier implementation.

I would agree with either approach (if we are certain it has no gotchas and have somebody who can dedicate time to implement it). I dont need my idea to be implemented, I am very pragmatical - let implement anything we can and agree on.

sosreport / sos

Run cleaner concurrently also inside one archive #3097

Separately