scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.85k stars 594 forks source link

Pre-commit (dev workflow) #1563

Closed ivirshup closed 3 years ago

ivirshup commented 3 years ago

I'd like to start using pre-commit with scanpy and anndata. Pre-commit is essentially a tool that manages scripts we'd like to run before each commit, e.g. linting and formatting, so it becomes essentially impossible to forget these.

I think this can allow PRs to progress faster since it gives us a way to codify formatting requirements – so we don't have to remember them – and have these checks happen locally – so we don't have to wait on CI. Of course, having these checks run depends on developers installing pre-commit, so we can also run these checks on CI (example ci script, example run).

There is a question of what things we'd like to add here. For sure: black. I think import checks (e.g. no unused imports) and flake8 would be good too. We can also add custom checks for things like slow imports.

I think this would be a good time to run black over the whole codebase so we don't have exempted files any more.

My questions for the dev team:

@michalk8, I saw you added this to squidpy. How's the experience been there – that is, any major foot guns we should look out for? Also, are there any tools you're using (beyond the basic black, isort, flake8) you'd especially recommend?

LuckyMD commented 3 years ago

I like this idea a lot. I recently learned about automated linting and blacking of code per commit and have started using it for the single cell open problems project upon the suggestion of @scottgigante.

michalk8 commented 3 years ago

@ivirshup

I'd recommend these:

Here's also an exhaustive list from which I picked the ones I use: https://pre-commit.com/hooks.html

As for any problems, some of them came from rstcheck as docs/source/classes.rst:9: (INFO/1) No directive entry for "autoclass" in module "docutils.parsers.rst.languages.en"., that's why .rstcheck.cfg might be necessary. Also fixing types for mypy takes a while, I'd do it as last. Also flake8-comprehensions forces you to abandon dict(foo=..., bar=...) calls, unless you disable it. And [tool.isort] in pyproject.toml should have profile = "black".

I had also planed to use pylint to enforce naming conventions, etc., but it has large overlap with flake8 + the config file can be rather large, see https://gist.github.com/dkatz23238/08d79a76911ddaaa6171dff9cddd5772

Koncopd commented 3 years ago

I really don't want it to be mandatory. It is not a rational objection, i just prefer my code to be committed as i wrote it and then run some formatting scripts separately if needed.

ivirshup commented 3 years ago

@michalk8 thanks for the extensive recommendations!

I think I'd like to keep the number of tools used small. It's the worst when you want to fix a bug, but instead have to learn about configuring a linter. More tools means more configurations people need to be familiar with, and the goal is reducing cognitive load.

Also fixing types for mypy takes a while, I'd do it as last.

Yeah, I figured this would be the case. Does mypy allow partial typing these days? Also, I haven't found the numpy or pandas type stubs to always be great. Have you run into problems around this?

I think this would also need to wait at least until we can drop python 3.6 for anndata, since adding types there currently means circular dependencies.

rstcheck to check the syntax of .rst files

I would particularly like a linter for rst. I noticed you also had doc8, but you'd recommend rstcheck check over this? I'm a little worried, considering its last release was over a year ago.

Spell check for prose in doc-strings could also be great, but I could see this being overzealous (is there a good way to notify about misspelled words, while not being annoying about technical terms?)

I'm a little worried about some custom sphinx extensions we have, and conflicting with this, any experience here?


@Koncopd, I think I agree with your concern, as I said above: it's the worst when you want to fix a bug, but instead have to learn about configuring a linter. I also think it's very easy to add new checks, so someone complaining about new ones is valuable.

Per commit, this should always be an option with git commit --no-verify, though you could also just not install pre-commit.

I would like to keep the required checks limited, ideally formatting tasks that can be automated as opposed "this is poor style" warnings. I also know these tools can be wrong (e.g. black when expression's have many operators, sometimes with chaining) so it would be good to document the escape hatches for this (# fmt: off for black).

That said, we already do require that merged code goes through black before it gets merged, and a benefit of using this would be to not have commit messages like "formatting", "remove unused import", etc. The pre-commit checks would be a part of CI as well, so it would be eventually mandatory – just not on your machine.

Does this address your concerns?

LuckyMD commented 3 years ago

I think I misunderstood... I thought you were suggesting things like autoblack in github actions for commits.

giovp commented 3 years ago

If I can jump on this, talking by personal experience, it would be a very very useful tool for contributors, especially young/inexperienced ones (like me!).

In squidpy @michalk8 put together a very comprehensive check list in pre-commits, and I'm appreciating it more and more as I get familiar with it. yes, there is a lot of cognitive load at the beginning, and yes it can be very (very) painful, but when you get used to it, it soon becomes essential and actually really useful.

Only concern of course is that it highers the bar for contributions in the repo, but honestly I'm seeing it being adopted in other large bio-related oss (e.g. https://github.com/napari/napari ). I think this can be simplified by having an extensive contributors guide, and the explicit mention on how to skip pre-commits and submit the PR anyway (and then otehr scanpy dev can jump in and give suggestions on why precommits failed).

flying-sheep commented 3 years ago

@Koncopd pre-commit doesn’t have to be configured, you can choose not to enable it. Of course tests will fail then.

regarding mypy: I guess it’s possible to make everything return -> t.Any: # TODO fix typing

I like isort!

Also we should get #1527 in before doing any big restructuring: It’s been through too many rounds of delays and I had to resolve conflicts so many times.

ivirshup commented 3 years ago

I think it would be nice to start small, then gradually add to this. I'd nominate black as the first step, then using pre-commit on CI, then adding more tools. I like black for a starting place since we can just run it over both dev and stable branches and be confident in nothing breaking. For anything that requires more manual intervention, I think we'll have to wait until close to a new stable branch being cut so we don't have to worry about conflicts during backports.

I had also thought isort could be a good starting place, but it might actually be some work to turn on due to "partially initialized module" errors (imperative programming strikes again!).

Some initial settings for `isort` ```toml [tool.isort] profile = "black" multi_line_output = 3 skip = "scanpy/__init__.py" ```
giovp commented 3 years ago

can we get at least flake8 also?

scottgigante commented 3 years ago

I just added black, isort, and autopep8 to scprep which worked pretty seamlessly. pre-commit config: https://github.com/KrishnaswamyLab/scprep/blob/dev/.pre-commit-config.yaml precommit github action: https://github.com/KrishnaswamyLab/scprep/blob/dev/.github/workflows/pre-commit.yml

ivirshup commented 3 years ago

I've just merged initial pre-commit stuff via #1684 (just black). Once the doc builds propagate, there will be a section under: "Getting set up" in the dev docs on how to install.

I would like to see a flake8 PR, though it might take a bit of time to hash out configuration. Maybe @giovp or @Zethson would be interested in looking into this?

Zethson commented 3 years ago

Yeah, I can add flake8 and some more stuff

Koncopd commented 3 years ago

I'd prefer to have only black on pre-commit for some time.

Zethson commented 3 years ago

I'd prefer to have only black on pre-commit for some time.

flake8 just checks and does not format. It makes aware of some nasty code and these things should get fixed immediately. I strongly advocate for it.

Zethson commented 3 years ago

I had also thought isort could be a good starting place, but it might actually be some work to turn on due to "partially initialized module" errors (imperative programming strikes again!).

Yeah, I ran into this stuff when creating the flake8 PR (#1689). isort is dangerous with Scanpy and requires good testing and many exceptions.

ivirshup commented 3 years ago

@Koncopd, I would agree, but there have a been a few bugs recently that flake8 definitely would have caught. This is stuff like "parameters are never used".

Rules that I think are useful:

If you run into any rules you don't like, you can just add them to the ignore and we can deal with it during review.

Does that sound reasonable?

flying-sheep commented 3 years ago

Flake8 can indeed be buggy, but the rules are very modular and can be turned off.

Even if we end up with only “unused import” and “unused parameter” being left, it’s a win.

flying-sheep commented 3 years ago

One thing I noticed: The pre-commit GH workflow for a commit I made seems to be queued for quite a while, doesn’t appear in the list of checks, and the PR thinks all checks have run.

This doesn’t seem ideal, as PRs can slip in where it will only run (and then fail) after the PR is merged.

Can we configure it to be mandatory? Will that change anything?

ivirshup commented 3 years ago

Github actions are down. It seems like they have problems at least once a week: https://twitter.com/githubstatus.

I'd be fine with having pre-commit be a required check. I'm a little antsy about having something that goes down frequently be required though.

flying-sheep commented 3 years ago

double-quote-string-fixer could be used to keep single quotes with black!

Zethson commented 3 years ago

double-quote-string-fixer could be used to keep single quotes with black!

While true, I think we should just stick to Black's default (although the 88 line limit is still killing me). I would add:

  1. check-added-large-files
  2. end-of-file-fixer
  3. trailing-whitespace
  4. check-yaml
  5. check-toml

and maaaaybe https://github.com/prettier/pre-commit

ivirshup commented 3 years ago

I'd be up for all of the numbered ones. IIRC, I had some issues with the trailing whitespace/ end of file fixers and some binary files/ csvs in the test suite. I'm a bit worried about false positives with check-large-files, but so long as it's easy to allow certain things (e.g. intentionally added test data) it should be fine.

In terms of breaking these things down into small tasks/ PRs how about: (1), (2, 3), (4, 5)?

prettier looks a bit heavy and like it's targeting a lot of stuff we don't use, so you'd have to make a good case.

ivirshup commented 3 years ago

I think we've done enough to mark this as closed. For any additional checks, please open a new issue.

flying-sheep commented 3 years ago

I think we should just stick to Black's default

we’re not, we skip string normalization entirely: https://github.com/theislab/scanpy/blob/4dd8de9e355ce8d59009c15522670a3d0970462c/pyproject.toml#L133

giovp commented 3 years ago

does anybody knows of a pre-commit check that checks for spelling in docstrings? I think that would be very cool to have.