Refactor file execution modules

smarsching commented 7 years ago

The file execution module has different implementations for Windows and non-Windows systems. In the following, the term POSIX will be used for all non-Windows systems, even though some of them might not strictly adhere to the POSIX specification.

Status quo

At the moment, the file module provides the functions for POSIX systems and the win_file module for Windows systems. On Windows systems, file is actually an alias for win_file. However, the win_file module actually uses functions from the file module. As some of these functions depend on other functions from the file module, but these functions are actually overridden by the win_file module, the win_file module does not simply call the functions in the file module. Instead, it selectively imports some of the functions, wrapping them with a new context so that they effectively run in the context of the win_file module.

While this concept works, there are a few drawbacks:

The win_file module has to be aware of internal implementation details of the file module. For example, when a function from the file module internally calls another function that is private to the file module, the win_file module still has to import this function (possibly also wrapping it in a new context).
There is no clear separation between shared and POSIX-specific code because both reside in the same module.
The code in the win_file module is not very clean and only readable for experts that understand the logic behind binding the functions from the file module to a new context.

These drawbacks are not just of a theoretical nature, but have lead to actual bugs (e.g. #39883).

Proposal

The logic for the file module should actually be broken into three parts:

A posix_file module that provides file on POSIX systems.
A win_file module that provides file on Windows systems.
A library module that provides functionality that is shared by both modules (e.g. logic for comparing and updating file contents).

For implementing the shared libary, providing an abstract class seems like the most reasonable approach. The platform-specific modules can then both have more specific classes that implement the platform-specific method and - if needed - override methods in order to adjust them to platform specifics.

This new structure achieves several goals:

Shared and platform-specific code is clearly separated.
We get rid of the nasty context-binding in the win_file module.
Internals are hidden and there are clear extension points for using platform-specific features from shared code.

lorengordon commented 7 years ago

Ch3LL commented 7 years ago

@twangboy seems this is a feature request but I would like to get your opinion on this. Is this something we would like to do? This would change the way we approach the setup of this module.

twangboy commented 7 years ago

@Ch3LL In the issue he mentions in the above feature request I suggested putting shared functions in a salt util. That might be a good way to clean this up.

damon-atkins commented 7 years ago

The structure has been talked about before in the renaming the modules thread. Something along the lines of <module>_share or <module>_all <module>_win <module>_posix <module>_macos <module>_shareis used by the other modules and contains common code.

I see two ways of doing this (others may see more ways): 1 _share or _all which is the core code, which loads platform specific functions e.g. salt_setacl form . And every _ must implement salt_setacl even if it does noughthing. 2 loads _share

In terms of file in general It should use mmap to read any file, and write any file line by line. This would allow salt to support large files. Parts of the current code read files into memory, no matter what the size is.

"File" should not be directly changed, they should be written out to a tmp file and then renamed, this avoids issues like disk space running out and having half files. Before the file is replaced it should be backed up/versions in a salt backup directory.

"File" should perform all operations for "a" target file in one hit. i.e. accumulated the required changes from various adjustments in different state files.

"File" should support ordering of operations against a target file

"File" needs to track files it manages. No file operations should happen from any state/execution module with out "File" knowing about it. e.g. pick up one state wants "\r\n" and another wants "\n" or that a module wants to manage the file and report an error. Maybe allow modules in virtual() to notify "File" that they manage a file and indicate if its exclusive or open relationship.

Their are two end of line markers "\r\n" and "\n" . If their is a source file e.g. template it should determine the line feed. But it should also be possible to select platform default or force it or use existing with a select fail back.

Determine the file default end of line maker should be seeking to the end of the file -2 and reading the last two characters in the file.

smarsching commented 7 years ago

@damon-atkins Thanks for your comments. It really makes sense to have a common scheme for dealing with platform-specifics. So we should try to use the same style for the file module that is also used for other modules. Do you happen to have a link to the "renaming the modules thread" for future reference?

Regarding line endings, I think that #14135 gives some good ideas. In particular, I agree with the idea that it should be possible to use the same templates for Linux and Windows minions and do the conversion in the file module. Another thing that maybe should be considered at the same time is conversion between different encodings. I think that most systems use UTF-8 nowadays, but I would not be surprised to find legacy applications that might still use something like ISO 8859-1 or CP-1252. Of course, it must also be possible to disable all conversions so that binary files can be managed in a reliable way.

Regarding mmap vs regular I/O, I still maintain my opinion that for the tasks handled by the file module, mmap does not give us any benefits. The current implementation using file.readlines() (or similar code) in a number of places obviously is a bad idea because the file has to be completely loaded into memory. However, reading chunks of data (e.g. something like file.read(4096)) and processing these chunks should be fine.

In #39882 you wrote that "Python is not multi-threaded". I answer here because I want to keep the discussion in #39882 specific to the PR. Python in fact is multi-threaded (please refer to the documentation of the threading module for details). It is true that due to the global interpreter lock (GIL), it is not possible for two threads to execute Python code at the same time. However, this limitation only applies to specific implementations of Python. CPython and PyPy have a GIL while IronPython and Jython do not. Even with an implementation that has a GIL, using threads has advantages because code implemented in C can actually release the GIL when waiting, thus enabling another thread to run. This is particularly useful when dealing with I/O because in such a scenario the total time needed for a certain task to finish is often dominated by the I/O subsystem and not by the CPU and thus being able to continue some tasks while waiting on the I/O for others can improve performance significantly. For this reason, I think that thread-safety should always be kept in mind, even when writing Python code.

It might be that Python's file abstraction layer already deals correctly with I/O errors when using mmap. I just know that mmap is a PITA when using it in C/C++ because of the way how errors are handled. When using the regular read / write operations, an I/O error will simply result in a corresponding return code (an exception in Python). With mmap however, an I/O error results in the kernel sending a SIGBUS which normally results in the process being killed. So to handle such an error gracefully, one has to install a custom signal handler and use sigsetjmp / siglongjmp for recovery. This is extra fun when dealing with threads because signal handlers are per process and not per thread, so you if you do I/O in library code, you cannot assume that your code is the only one wanting to catch SIGBUS...

What I am trying to say is that in my experience using mmap usually is not worth the trouble. I learned about all these nasty details when I had to use mmap for communicating with a device with a kernel driver that only allowed access with mmap. As there is no benefit in performance (refer to these benchmarks for details), avoiding mmap is a good choice in my experience.

lorengordon commented 7 years ago

Salt currently uses mmap in the file.replace function. (I implemented that change over a year ago, I believe.) Other than handling of zero-byte files, I'm not aware of any issues that have cropped in as a result that required special-casing.

smarsching commented 7 years ago

@lorengordon Thanks for your input. If mmap-based IO works reliably in Python, I guess there is no reason to not use it.

Do you happen to know how Python’s mmap module handles large files (too large to fit into the virtual address space) on 32-bit systems? Unfortunately, the documentation does not say anything about this and naively I would expect that it cannot handle such files because they cannot be mapped into the virtual address space.

lorengordon commented 7 years ago

@smarsching seems likely that mmap would still be subject to 32-bit limitations. Hopefully if people are working with files over 1GB or so, they are able to use 64-bit python on a 64-bit system. This writeup is pretty decent:

http://archive.oreilly.com/oreillyschool/courses/Python4/Python4-15.html

The tradeoffs seem worth it to me to use mmap, at least for file.replace, as multi-line search/replace functionality is a real pain when using chunking and the pattern crosses a chunk boundary.

I can see how it might be sub-ideal for file.managed, which is more often dealing with the transferring of large binary files. Perhaps it would make sense to chunk files for transfers (possibly nice for parallelizing the transfer too), but use mmap when reading a file for templating (which is likely to be smaller files?). Or something along those lines.

damon-atkins commented 7 years ago

If your just copying a file then yes 1MB reads and writes are best. If you need to be able to edit large files mmap pages bits of the file into memory and does larger size read/writes than stdlib read/write . e.g. its like swap

Currently lot of the code reason a complete file into python ram, and does not even check the size of it first.

mmap is better if your reading a file, and need to read behind the current file position. I was talking about just using it on the read side. Should not need to use it on the write side. Salt should not edit files in place. i.e. it should write a new file, set perms on the new file and move it over the top of the old file. (unless the file does not exist in the first place)

32bit Python has 32bit limits

smarsching commented 7 years ago

I agree with you that implementing search and replace is much simpler when using mmap-based I/O. I like @lorengordon’s idea of using block-based I/O for simple copy operations (e.g. file.managed) and using mmap-based I/O for more complex tasks (e.g. in file.line). For these operations, the ~2GB limit on 32-bit systems should not be a problem because those functions are only useful on text files anyway.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

damon-atkins commented 6 years ago

.

stale[bot] commented 6 years ago

Thank you for updating this issue. It is no longer marked as stale.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

mchugh19 commented 4 years ago

Not stale

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.

saltstack / salt

Refactor file execution modules #39960

Status quo

Proposal