trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

General `Trilinos` maintenance: useless spaces are making up to 2 `MB` #12398

Open romintomasetti opened 1 year ago

romintomasetti commented 1 year ago

Enhancement

Trilinos is full of:

On today's develop, running the Python script attached (see below) leads to:

INFO:root:> Parsed 51715 files
INFO:root:> There are 1435701 errors of type Types.TrailingSpace, total weight 1810923
INFO:root:> There are 46797 errors of type Types.OnlySpace, total weight 201760
INFO:root:> There are 480 errors of type Types.ParseError, total weight 480

That is, each time someone makes a shallow clone of Trilinos (thinking of the CI mostly), it gets almost 2 megabytes of...spaces (I don't know what happens for the 480 files for which I got a UnicodeDecodeError and I guess it deserves a new issue on its own).

This:

Actions:

  1. I agree that removing trailing spaces all at once is not desirable as it breaks the history of the modified lines. However, I think it would be a great addition to add something along commonTools/test/utilities/check-mpi-comm-world-usage.py (@JacobDomagala) that ensures no new trailing space is added.
  2. Make a single PR that, for Trilinos packages, removes lines whose content is only spaces.
    • This would not modify the meaning of the files, but would already save somewhere around 201760 characters (I guess a bit less than that, see next bullet point).
    • Of course, files that are snapshots of other repos should not be touched (Google Test, Kokkos and others)
    • I haven't made a thorough study to understand what are the sources of the spaces but sure there are many in files maintained by Trilinos packages so it's worth a try.

To quote someone that phrased it much better then me

Trailing whitespace has nothing to do with the quality of the code because the code is the same with or without it.

Trailing whitespace is just garbage. It makes source files bigger and makes comparing changes more difficult. This is especially annoying when you have to do a code review or merge, but one version of the file has different spaces than the other. Keeping trailing spaces trimmed is therefore considered as part of basic hygiene together with consistent indentation, consistent new-lines and so on.

Last pro: useless tokens clearly have an (slight) impact on the compiler's required time to parse Trilinos source code. Removing the trailing spaces might induce very tiny gains with that respect.

trailing_white_space.py.txt

cwpearson commented 1 year ago

This will replace any line that is only spaces with an empty line:

shopt -s globstar
sed -i -e 's/^[[:space:]]*$//g' packages/PACKAGE/**/*.cpp packages/PACKAGE/**/*.hpp

This will remove any trailing spaces on a line (including a line that is only spaces, note missing ^ in regex)

shopt -s globstar
sed -i -e 's/[[:space:]]*$//g' packages/PACKAGE/**/*.cpp packages/PACKAGE/**/*.hpp

e.g. for Tpetra:

https://github.com/trilinos/Trilinos/compare/develop...cwpearson:Trilinos:cleanup/remove-trailing-spaces

GrahamBenHarper commented 1 year ago

@romintomasetti @cwpearson I have done some work related to this previously in MueLu with #11258 using a sed command inside a find command. I also opened #12058, and the MueLu team decided that it would be an uphill battle to try to maintain standards occasionally without also having some sort of Github action to enforce standards on a PR level.

Developing such a Github action is a work in progress still, but it's only one package. Enforcing it across all of Trilinos is a different story entirely and would require mass adoption from developers and leadership.

romintomasetti commented 1 year ago

Hi @cwpearson and @GrahamBenHarper !

Thanks for sharing your thoughts.

I think there are 2 points w.r.t. trailing spaces:

  1. New commits should not introduce trailing spaces. Trailing spaces is a bad style and I see no valid reason why in 2023 any package would be allowed exemption from this rule. Any decent IDE has capabilities for helping with trailing spaces BTW.
  2. As I said, changing existing lines of code to enforce any sort of style rule is the best way to lose the history of the line. I think it's best applying stylistic changes only when the line is touched by a new development.

Based on these 2 points, I would propose that a Github action is setup for banning new trailing spaces in any Trilinos package. See below for a simple script that does that.

This means:

  1. No package maintainer is forced to make a PR to delete current trailing spaces.
  2. All package developers must comply to the no-more-trailing-spaces rule. No exception to that.

@GrahamBenHarper Enforcing clang-tidy at the Trilinos level would be easy, because you can opt-in for a few basic rules (for instance Trilinos could decide that only llvm-namespace-comment must be enforced globally, see also https://clang.llvm.org/extra/clang-tidy/checks/list.html for a complete list of supported checks). I agree that enforcing clang-format is a different story, because the checks are not opt-in and could lead to significant changes that sometimes aren't pretty. And, as I mentioned, significant stylistic changes of existing code without any new development should not be allowed in my opinion.

Here is a quickly drafted script. I used Python instead of any bash-based solution because of the additional flexibility to filter files you want to ignore for instance. I could run in python:latest Docker image in which you install GitPython. Use the script without any argument to look at all tracked files, or with --tracked-changed-only --commit-start upstream/develop --commit-end HEAD to compare your local branch to whatever upstream branch you want.

import argparse
import logging
import os
import pathlib
import typing

import git
import typeguard

@typeguard.typechecked
def parse_args() -> argparse.Namespace:
    """Parse CLI arguments."""
    parser = argparse.ArgumentParser()

    parser.add_argument("--directory", help = "Directory wherein checks are performed (recursively)", required = False, default = os.getcwd())

    group = parser.add_argument_group(title = 'Tracked changed only', description = "Only look at Git tracked files that changed between 2 commits.")
    group.add_argument("--tracked-changed-only", action = "store_true", help = "Enable looking at tracked changed files only.")
    group.add_argument("--commit-start", type = str, required = False, help = "If only tracked changed files must be considered, this is the start commit.")
    group.add_argument("--commit-end"  , type = str, required = False, help = "If only tracked changed files must be considered, this is the end commit.")

    args = parser.parse_args()

    if args.tracked_changed_only and (args.commit_start is None or args.commit_end is None):
        parser.error("Start and end commits are required.")

    return args

@typeguard.typechecked
def files_to_look_at(args : argparse.Namespace) -> typing.Iterator[pathlib.Path]:
    """
    Generator that either:
        - only keeps files that are tracked and modified between 2 Git commits are kept.
        - keeps all tracked files
    """
    logging.info(f"Listing tracked files to look at based on {args}.")

    repo = git.Repo(path = args.directory)

    if args.tracked_changed_only:

        commit_start = repo.commit(args.commit_start)
        commit_end   = repo.commit(args.commit_end)

        changed_files = repo.git.diff('--name-only', commit_start, commit_end).splitlines()

        logging.debug(
            "Filtering any file matching generator with files changed between "
            + f"'{args.commit_start}' ({commit_start}) and '{args.commit_end}' ({commit_end}):\n"
            + str(changed_files)
        )

        for changed in changed_files:
            yield pathlib.Path(changed).absolute()
    else:
        for tracked in repo.commit().tree.traverse():
            path = args.directory / pathlib.Path(tracked.path)
            if path.is_file():
                yield path

@typeguard.typechecked
def filter_what_you_want_to_ignore(files : typing.Iterable[pathlib.Path]) -> typing.Iterator[pathlib.Path]:
    """
    Add any filtering rule.
    """
    logging.info("Filtering to keep only files we want.")

    # As an example, let's filter any PNG or YAML file.
    for file in files:
        if not any(file.suffix ==  suffix for suffix in ['.png', '.yaml']):
            yield file

@typeguard.typechecked
def check_trailing_space(files : typing.Iterable[pathlib.Path]) -> None:
    """
    Check for any trailing space in provided files.
    """
    logging.info(f"Checking for trailing spaces.")

    has_error = False
    for file in files:
        with open(file, 'r') as f:
            for counter, line in enumerate(f):
                if line.endswith(' \n'):
                    has_error = True
                    logging.error(f"Trailing space at {file}:{counter+1}")

    if has_error:
        raise RuntimeError("At least one trailing space detected.")

if __name__ == "__main__":

    logging.basicConfig(level = logging.INFO)

    args = parse_args()

    check_trailing_space(files = filter_what_you_want_to_ignore(files = files_to_look_at(args = args)))
ccober6 commented 1 year ago

I would like to add this topic to the TUG's Developers Day Open Discussion, and get additional input and see if we can get consensus.

romintomasetti commented 1 year ago

@ccober6 It seems the topic was not addressed during TUG's Open Discussion session... What's the way forward? :smile:

cgcgcg commented 1 year ago

@romintomasetti We will put together a clang-format approach at the MueLu level (github action, pre-commit hook, and some type of bot). Once we feel like we have a reasonable approach other packages can opt in. We will have one massive reformatting commit that can be ignored in git blame.

github-actions[bot] commented 1 week ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.