microsoft / lisa

LISA is developed and maintained by Microsoft, to empower Linux validation.
https://aka.ms/lisa
MIT License
185 stars 164 forks source link

Proposal on LISA improvements (was The Future of Lisa) #2782

Closed avylove closed 11 months ago

avylove commented 1 year ago

This is derived from a document I wrote, but since that wasn't accessible to all who may find it useful, I figured putting it here made more sense where it can be tracked and actioned.

TL;DR

Due to low code quality, failure to adhere to standards, and architectural issues, LISA requires a great deal of work to reach a minimum level of maintainability. Without this effort, LISA is prone to bugs, regressions, and escalating maintenance costs. Some effort had begun to improve the current version, but getting these changes integrated has been slow due to inadequate test coverage and fear of introducing change to a fragile codebase. At this point those efforts are largely stalled.

While stabilization improvements may be achievable with the current code base, it will be faster to start fresh, incorporating lessons learned from the previous iterations. It is also possible to significantly reduce the amount of code Microsoft maintains by leveraging other Open Source projects, such as Ansible. It is recommended to create a requirements document for LISA and evaluate Open Source options against these requirements before proceeding with a replacement.

Background

This document came about as a result of the author attempting to ramp up to and contribute to the LISA project. Finding limited documentation and low code quality, the author initially thought the issues could be resolved with minimal to medium effort by applying standard linting, code cleanup, best practices and improving documentation. However, early in this effort, it was clear this would be a much larger task due to a lack of test coverage, undocumented code, and design issues. Continuing down this path would take years due to the extremely high level of technical debt taken on by the project.

Business case for this document

A naive observer may look at the LISA project and assume because it is being used it is both functional and adequate. However, IT organizations are concerned about much more than if software is functional at a specific point in time. Two primary concerns for IT organizations are efficiency and responsiveness.

Efficiency not only accounts for the amount of compute resources needed to run software, but also includes the cost of developing the software, the cost of maintaining the software, the cost of learning to use the software, and the human cost of using the software. While the cost of learning to use and actually using LISA are much higher than necessary, we’re primarily focused on the cost of developing and maintaining the software, or CoC (Cost of Change).

Responsiveness is not only the responsiveness of the running software, but the responsiveness of the organization to change. Change may be required due to other technical changes or customer need. Poor design and high technical debt increase the CoC (Cost of Change). This, in turn, leads to diminishing responsiveness. This is illustrated in the sample chart below.

image

The graph shows CoC (Cost of Change) in a hockey stick pattern. This is because the accumulation of technical debt accelerates as maintenance becomes more expensive and gets deferred. Extrapolating this you get to a state where code is not maintained at all for fear of introducing unexpected changes. This is sometimes referred to as a death spiral.

The LISA project may have already entered a death spiral. This is clear from the long lead time for basic maintenance and aging dependencies. There are a few examples later in this document such as Support for Python 3.11 which, despite requiring simple changes, took 9 months to implement and was exacerbated by multiple layers of technical debt. It’s also evident in this PR for updating a year-old CI dependency where, after being stalled for 3 months waiting for review, the primary maintainer of LISA stated “we don't have bandwidth to test regressions” and suggested adding additional technical debt rather addressing the issues.

Essentially, we have reached a point where the current iteration of LISA is not getting basic maintenance and requires a significant increase in resources or risks atrophying to a point where we will no longer be able to address customer needs.

The current state of LISA

An initial review of LISA found a number of issues which led to higher numbers of bugs, security vulnerabilities, reduced performance, a steeper learning curve, increased maintenance costs, and increased development costs.

These issues are outlined in the following sections:

Non-standard and incomplete packaging

Python packages are self-contained archives with metadata describing the software and dependencies. Lisa is partially packaged, but due to an atypical layout, two directories, examples” and “microsoft”, fall outside of a packageable area. The “microsoft” directory, in particular, contains runbooks and code critical for LISA’s operation. To work around this, LISA must run out of a cloned git repo. This requires additional steps of maintaining a git submodule, cloning the repo, checking out the desired branch, installing as an editable package, and using flags to enable deprecated install behavior to ensure the LISA source files are found in the Python path.

In order to resolve the layout issues, code needs to moved out of these extra directories and the runbook path behavior needs to be modified so the runbook files included in the extra directories can be accessible from standard install locations.

Once these change are made, LISA, can be published to PyPI and installation will be as simple as pip install lisa. This also unblocks proper packaging of the internal repo, which can be published to an internal ADO Python package index, reducing installation to adding the index URL and running pip install .... Not only will this simplify the user experience, it will speed up installation, and prevent failures once the deprecated behavior being leveraged is removed from pip.

There were additional issues with packaging which have been largely addressed by lisa #2353 and a corresponding internal change. These included:

Limited test coverage

LISA has limited unit test coverage (26.64%) and fragile code. This requires changes to be manually tested for each PR. Manual testing is error prone, limited in scope, and slow. This not only increases costs, but also the time required to make changes and the likelihood of regressions. The current state reduces confidence in pushing changes, resulting in delays accepting changes. This further increases time and costs because code must be rebased multiple times.

Increasing test coverage is mainly a matter of putting resources to writing tests, but, since LISA was not written with testing in mind, testing some parts of the code may be more challenging than necessary. Writing code with testing in mind involves keeping components independent with well-defined interfaces and reasonable defaults to readily be able to use them with stubs. This is generally good software design independent of testing.

All these issues compound when implementing changes that touch multiple parts of the code. For example, a PR to resolve a security vulnerability has been sitting unmerged for a month for fear it will unexpectedly break something. With adequate test coverage, this would not be an issue.

Code quality

Standard linting was only added to LISA in November 2022 and the majority of the checks needed to be disabled because they were failing. While, in some languages, linting is purely stylistic, since Python is an interpreted language, linting is equivalent to compile-time checks for compiled languages. It is used to find bugs, security vulnerabilities, and programmatic errors. Many of the issues listed below were identified thought standard linting. Others were found through reviewing the code and documentation.

Minor to Medium effort to resolve

The items listed in this section can be resolved without invasive changes to the code.

Some issues listed have been resolved or have an open pull request. These came about as part of the review for this document and have been submitted by the author. They are included here for completeness. For these cases, the pull request is noted.

Use of deprecated modules (Fixed in 2538)

Some logic was imported from a standard library module slated for removal.

Redefinition of build-in objects (Fixed in 2507)

Built-in class and function names were used for generic variables. This can cause unexpected behavior when one attempts to use the built-in after it has been overwritten.

Use of an undefined or conditionally defined variable (Fixed in 2498)

There are cases where a variable is used before it is assigned or may not be assigned in all cases.

Security Vulnerability: Use of eval (Fixed in 2514, submitted Jan 9, merged Feb 28)

The eval() function in Python is used to interpret a string object containing Python code at runtime. Seven cases were identified where eval() was used on unsanitized user input. This allows a user to execute arbitrary Python code from within LISA, with access to any resources available to LISA.

Modification of a list while iterating over it (To be fixed in 2561, submitted Jan 10)

If a list is modified while it is being directly iterated over, unexpected values can be returned from the list

No timeout specified for HTTP requests (To be fixed in 2539, submitted Jan 19, merged Fed 22)

Without a timeout specified, unresponsive HTTP requests can hang until the process is killed externally

No support for Python 3.11 and higher (To be fixed in 2557, submitted Jan 25, merged Apr 4)

Python 3.11 went GA in October 2022 but has been available for testing since October 2021. There was an effort by Anthony Shaw (Python Software Foundation Fellow and Microsoft Python advocate) to add support to LISA in July 2022, but it fizzled out after hitting a non-obvious error. The PR was closed 4 days after 3.11 went GA without merging the required fixes that were already identified.

The non-obvious error was due to an improperly implemented enum which leveraged a glitch corrected in Python 3.11. Mypy would have caught the issue, but the issue was masked due to improper typing.

Python 3.11 is a particularly important release to the community and Microsoft. Microsoft hired the creator of Python, Guido van Rossum, in November 2020 to work on improving the performance of Python. Python 3.11 is the first release of this multi-release effort. Python 3.11 is 10 to 60% faster than 3.10 and has a smaller memory footprint. Real world examples have shown CPU utilization cut in half simply by moving to Python 3.11.

Unnecessary Dependencies

LISA has a large number of external dependencies. Eternal dependencies are unavoidable in this type of program, but there are cases where packages are included for functionality that can be implemented with very little code. Including external packages makes sense for code that is difficult to implement, but unnecessary imports increase maintenance, install time, load time. Some Examples are listed below.

The use of the randmac package to create a random MAC address can be implemented with the single line f"x2:{':'.join(format(random.randint(0, 255), '02x') for _ in range(5))}".

The use of the python-dateutil package for date parsing rather than using the included datetime.datetime.strptime()

The use of the retry package rather than a simple context manager.

Unmaintained Dependencies

Standard engineering principles hold that dependency pinning creates an implicit contract with downstream consumers that the dependencies are monitored and maintained. For this reason, dependency pinning should be pushed as far downstream as possible. However, for various reasons, LISA pins dependencies in the development branch. These dependencies have gone unmaintained with some dating back to November 2021. This prevents LISA from taking advantage of security fixes, bug fixes, and efficiencies added to these packages.

Some recent effort has been made by the author to improve this by adding automatic dependency checking and PR creation. While this has been partially deployed, full deployment has been held up waiting to PRs to be merged. This is mainly due to inadequate repo management and limited test coverage.

Invalid implementation of __hash__()

The __hash__ method in Python is used to indicate uniqueness and should return a unique integer no greater than the size of sys.hash_info.width, typically the machine’s word size. __hash__() implementations in LISA returns strings, that are not always unique between instances and, at times, too long to preserve any uniqueness after being truncated to sys.hash_info.width.

Redefinition of variables in an outer scope

There are at least 21 cases where a variable has been given the same name as a variable in an outer scope. This can lead to bugs because the outer scope variable may be accessed without error before the inner scope variable is defined. Once the inner scope variable is defined, it can unintentionally mask the outer scope variable.

No encoding specified when reading or writing files

There are 26 identified cases where an encoding is not specified when reading or writing files. This can introduce compatibility issues between platforms. For example, documentation building was broken on Windows for this exact reason.

Various syntax and logic improvements

There are 276 additional cases of syntax and logic improvements identified through standard linting. Many more have been identified through additional analytic tools, such as Sourcery, and manual review. Improvements like these simplify logic, reduce computation time, reduce memory, and/or improve readability.

Major effort to resolve or rewrite needed

The items listed in this section require invasive changes to the code, extensive analysis, and/or breaking changes.

Incomplete or invalid class abstractions

There are 212 identified cases where a method in a parent class was abstracted but the subclass did not define the method. This points to invalid interface definitions and can be a source of bugs. Due to the large number of cases, automatic detection becomes ineffective.

Catching broad exceptions

There are 74 cases where a broad exception is caught rather than a specific exception. This is a common source of bugs because errors that would normally have been raised are caught and discarded, often in unrelated code. The best time to determine what exceptions should be caught is when the code is written since the developer will have context. Resolving these at a later point requires a developer to do additional analysis and research.

Cyclical Imports

There are 68 cases where a cyclical import was identified, primarily in the tools and operating systems modules. These are cases where an import depends on another import that depends on a previous import. These types of imports are very fragile and can result in race conditions where an object has not been initialized before it is accessed.

Duplicate Code

There are 60 cases where duplicate code was identified. This indicates the code has not been optimized. There are additional cases where nearly duplicate logic appears in multiple locations, though this is harder to detect across files.

Various memory and CPU inefficiencies

There are various places where memory and CPU are used inefficiently. Many of these cases are from unnecessarily keeping objects in memory longer than necessary or iterating over objects all at once rather than as needed. More complicated cases relate to the program’s architecture and involve creating objects that are larger than necessary or unnecessary at all.

Use of global statements

LISA has 8 incidents where global statements are used. These are cases where a variable in the global namespace is modified from a local namespace. While there are legitimate uses for global statements, they are extremely rare and often indicate an underlying design issue. To have 8 cases in a single program is extremely unusual. For a program of this size and complexity, it’s likely 0 to 1 of the globals are necessary. Further investigation is needed to determine the full scope of the issue.

Mix use of data classes, validation models, and general purpose classes

Data classes were introduced in Python 3.7 as an improvement over named tuples. They are intended as containers for typed data and, generally, have no methods.

Validation models (also known as schemas or data validation classes), are used to validate data received from an untrusted source. The most popular data validation libraries, Pydantic and Marshmallow, implement models as classes. The syntax is pure Python, but the structure is library-specific.

LISA has 142 data classes defined. Many are also used as validation models and general purpose classes. While this is supported, it unnecessarily increases complexity, and the way they are implemented prevents full introspection by linting and typing tools.

Incomplete plugin validation

LISA employs three different mechanisms for extending code: subclassing, abstract base classes, and Pluggy hooks. Not only is the number of methods confusing to downstream developers, but none of these methods perform plugin validation, so implementation errors won’t be presented until they cause a failure.

Compromised type definitions

Python is a dynamically-typed language, but use of static type checkers has become popular as a way of catching and preventing bugs. This requires type annotations to be added to the code.

LISA uses Mypy for type checking. However, when type annotations were added, the emphasis was put on ensuring Mypy passes rather than on correctness. This resulted in errors being broadly suppressed and objects being incorrectly typed.

Implementation like this makes type checking less useful. As an example, an improperly-typed object masked a bug which prevented LISA from running with Python 3.11

Lack of code documentation

The standard for Python inline code documentation is every class, function, or method should have a docstring explaining what it does, any expectations for the arguments, any limitations, and any additional information about return values that might be useful. Additionally, comments should be added to clarify non-obvious code or to break down larger blocks of code for readability. This goes back to the reality that code is read much more often than it is written, so it should be written to be read.

In contrast, LISA has little inline documentation, with some very large files completely devoid of comments and docstrings. This increases the time it takes to make changes, increases maintenance costs, and makes it more difficult for those unfamiliar with the code to contribute.

Limiting design

LISA has multiple areas where an attempt was made to decouple elements of the program, but it ended up creating additional coupling. An example is the concept of tools in LISA.

Ostensibly, tools are intended to abstract dependencies and syntactic differences between different platforms. While this can be worthwhile for specific tools, for general operations it adds complexity. For example, the basic activity of listing files is a method of the Ls class rather than a primitive that abstracts the implementation details. Whether this action is done by root or the logged in user is an argument to the method requiring redundant logic in most tool methods. This could have been implemented as a context manager, greatly simplifying the code.

LISA has a lot of other categories of objects, both internal and external. The purposes for these aren’t always apparent, either because there is overlap, little documentation, or because they use non-standard names (example: combinator vs matrix). It creates a steep learning curve for users and contributors alike.

Platform-specific components are not separated. This can lead to unexpected interdependencies, slows down development, and complicates testing. For example, the code to support Azure, AWS, and Libvirt are included in the main code base. The Azure-specific code, itself, has a large number of dependencies that can be inadvertently included in the main code base. Code like this is often kept in a separate repository, but could also be kept in an independent package within the same repository.

Open Source

Open Source software has many benefits, both for users and contributors. Some of these benefits are intangible, but many lead to reduced costs.

When open sourcing software, companies gain additional resources at no cost. Community members test software, identify bugs, provide pull requests, provide additional documentation, and expand capabilities. But to take advantage of this, a project must build a community. Open source is merit-driven, so communities will naturally form around quality software that provides a new capability or solves an existing problem in a novel way.

The current and previous versions of LISA have been open sourced under an OSI-approved license. However, LISA has not realized the benefits of community involvement, with almost all contributions coming from within Microsoft. This is primarily due to two reasons. The first is LISA does not provide a new capability. The second is LISA has low quality code and documentation. Open Source community members can’t justify investing time in LISA unless they are required to. This results in a small, largely disinterested community and little, if any, external support.

The inverse of this is companies can leverage existing Open Source projects with established communities to reduce their own development costs. LISA has done this to some degree by using community-provided Python modules for some capabilities. However, a large part of the LISA codebase reimplements functionality available in other projects. Not only does this require writing and maintaining new code, It means LISA is not taking advantage of the lessons these other projects have learned or the community contributions that deal with changes, security, performance, and useability.

User experience

Rather than a single configuration file, LISA is configured through both Python files and YAML files. Configuration should be in a single place. One problem preventing this in the current implementation is test suites combine configuration with additional logic. This additional logic could be in generic plugins, allowing all configuration data to be in a single place. The burden of two different types of files providing different parts of the configuration is confusing to users and increases the steepness of the learning curve.

Additionally, the documentation is incomplete, confusing, and has not been reviewed for grammatical inconsistencies. The names used are not standard and conflict with common terms. For example, what would normally be called a "matrix", a way of running tests against multiple input sets is called a "combinator" in LISA, a term usually associated with lambda expressions.

Maintainability

There are currently very few people, perhaps only @squirrelsc and @LiliDeng , who fully understand LISA’s design and implementation. Because it was not written with maintenance in mind, it would be difficult for anyone to be able to take over this work. In the event these resources were not available, the code would likely stagnate or be replaced.

As discussed in the Code quality section, Lisa is largely devoid of code documentation. This problem is compounded by current maintenance practices in which the majority of pull requests do not have descriptions. This means the only context preserved for these changes is the short pull request names and commit comments. Developers looking for context on why something was done a certain way have little information available to them. Not only is this frustrating, but it increases the development effort and can lead to code regression.

Alternatives

At its basic level, LISA is an orchestration framework for Linux system validation which can operate across multiple platforms. What it does is not particularly different from other orchestration frameworks, Ansible being the most popular. Ansible already comes with numerous modules, supporting all popular Linux distribution, cloud providers, and even Windows. It is also supported by Microsoft for use with Azure and used by other teams at Microsoft.

While some may think Ansible is strictly for deployment, use for validation is also common. Since LISA requires both deployment and validation, it seems well-suited to fill that gap. This video (7:05 for a good example) and the associated runbooks show how this might work. Essentially, desired state can be validated primarily using the Ansible’s built-in assert module. A callback plugin can be utilized to format and store the results as needed depending on the use case. Ansible is written in Python, has extensive developer documentation, and broad community support, so it is relatively easy to add additional capabilities.

If Ansible, or a similar framework, is utilized, it would need to be determined if it was a complete replacement or if it would live behind a thin wrapping interface. From a maintainability standpoint, it would be better if it lived on its own, but that may not fit the requirements.

The future of LISA

It’s clear, in its current state, LISA needs a lot of work just to reach a reasonable level of maintainability. At a minimum, it requires the following:

Some of this work has already begun, but is slowed by limited test coverage and fragile code. Changes are tested manually and review is limited to two people. Stabilizing the current code base would likely take years.

It’s unclear if LISA, as a framework, provides additional capabilities than those provided by other Open Source tools, such as Ansible. An investigation to see what gaps exist when using other tools should be carried out. Even if a wrapper is required, the reduced maintenance burden and extensive community make it worth exploring replacing the majority of LISA’s code with Open Source alternatives.

Ultimately, the changes needed to bring the current version of LISA into an expected level of maintainability and useability would likely require multiple breaking changes and take more effort than rewriting from scratch. Utilizing existing Open Source solutions, either as a complete replacement or for the majority of the custom code, will reduce development time substantially and reduce future maintenance costs.

In order to properly evaluate the available options, a requirements document should first be created to ensure all the required and desired capabilities are captured and agreed upon. This can largely be done internally, but it may also be worth engaging close partners to provide input.

Recommended course of action

  1. Create RFC 2119-style requirements documents to define requirements for LISA or its replacement
  2. Review Open Source options which could fully or partially replace LISA
  3. Identify gaps between Open Source solutions and requirements
  4. Depending on the results of the previous step a. If an Open Source solution can fully replace LISA, begin work on implementation plan b. If an Open Source solution cannot fully replace LISA, begin design work for LISA v4
squirrelsc commented 1 year ago

This is personal opinions, not represent LISA maintainers, and future of LISA. Thanks for the feedback, we respect different opinions. We had a lot of internal discussions already, won't be repeated here.

Anyone is welcome to submit PR to fix issues, which are mentioned here, but not fixed. The PR owner is responsible for quality of the PR, including test coverage. The evidence here is not enough to motive us to consider replacements.

andyleejordan commented 12 months ago

@avylove you might be interested in the other LISAv3 I wrote in https://github.com/microsoft/lisa/pull/1107, that was not accepted in favor of this version you're now having to deal with.