scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 594 forks source link

Import performance part 3 #756

Open flying-sheep opened 5 years ago

flying-sheep commented 5 years ago

In #406 we decided to get rid of scanpy.api, which worsened our import time. Thanks to @ivirshup (#703, #704), the main culprits to long import times are out of the game, but there’s still room for improvement.

I used profimp to identify the rest. I started with just profimp --html 'import scanpy', identified the external imports that take a while, and created a file in which I imported them before finally importing scanpy:

scanpy-imports.py

# anndata big imports
import numpy
import pandas
import zarr
import h5py

# scanpy big imports
import numba
import sklearn          # preprocessing._simple
#import sklearn.metrics  # neighbors
#import networkx         # diffmap, paga, plotting._utils
import leidenalg
import louvain
import matplotlib.pyplot
import tables           # sim → readwrite

# rest
import scanpy
$ profimp --html "$(cat scanpy-imports.py)" >! profimp-scanpy.htm
Outdated: 1.4s with networkx and sklearn.metrics ![grafik](https://user-images.githubusercontent.com/291575/62112333-d2149d80-b2b2-11e9-92e0-9887b8574c8e.png)

grafik

flying-sheep commented 5 years ago

The biggest ones are in order:

  1. [ ] numba: Hard to defer. We’d have to create our own jit decorator returning a callable object that numba-compiles and caches the real function on its first invocation
  2. pandas: Used all over the place, not feasible to defer
  3. [x] sklearn.metrics: Easy to defer I think, let’s start with this.
  4. [ ] matplotlib.pyplot: Shouldn’t be used in a library at all. It exists to import the kitchen sink in order to be low-friction for interactive use. Hard to do since we rely on it a lot, but we should do it.
  5. [x] networkx: Used in DPT, paga and plotting. Pretty easy

We use pandas all over the place, and it’s hard to defer loading numba as it works with decorators.

/edit: shaved off another 2/5 in a7729bc61ac569a718075edb4466852b0b4a696a via sklearn.metrics, scipy.stats, and networkx.

ivirshup commented 5 years ago

I don't think we should bother with numba, since it'll likely be a pretty core requirement once we can start transitioning to pydata/sparse.

For pyplot, does matplotlib also take a while to import? Management of environment variables is a good reason not to defer that import.

If we're already using h5py, could we drop tables as a requirement?

I think bad import times are only really noticeable for interactive use, since any script using scanpy will likely take longer to run. Do import times change depending on interactive environment? I wouldn't be surprised if different code ran when importing something like matplotlib in a notebook vs in a script.

flying-sheep commented 5 years ago

Matplotlib takes a while but less time. Can you please point me to what you mean with the environment variables?

No idea about tables, @falexwolf wrote the sim module I think and it’s not commonly used …

I don’t think import times change noticably, but I didn’t measure.

ivirshup commented 5 years ago

I've gotten complaints from Matplotlib about calling mpl.use, to set the backend after importing pyplot (relevant matplotlib docs). I think it would be unintuitive if packages behaved differently depending on what functions had been called. In general, matplotlib has a lot of state and messing with it has only brought me pain.

flying-sheep commented 5 years ago

That’s exactly backwards: I find it annoying if packages modify state on import.

We already jump through hoops in our testing framework to work around our misbehavior:

https://github.com/theislab/scanpy/blob/681ce93e7e58956cb78ef81bc165558b84d6ebb0/scanpy/tests/conftest.py#L4-L6

import matplotlib.pyplot [as plt] means “I’m an end user who just opened a notebook and I want the kitchen sink, give me everything and configure everything”. Libraries shouldn’t do it and scanpy is one.

When we still had scanpy.api there would have been a case for importing pyplot there, as scanpy.api was for interactive use. Now we don’t have any excuses.

ivirshup commented 5 years ago

I agree that it's bad behavior to modify state on import. I think it's worse to modify state after a function is called, save a few cases where it's obvious that will happen. I think it takes less time to figure out why my plot suddenly looks different if it's based on imports than which functions were called prior.

I think if we could make all of our plots without importing pyplot that would be great. I'm not sure how feasible this is. Not only do we use pyplot a lot, but libraries we depend on for plots (like seaborn) import pyplot.

flying-sheep commented 5 years ago

Ah, I didn’t know. If it was only us, we could easily get rid of it, but seaborn importing pyplot is a real shame :disappointed: FWIW, I filed mwaskom/seaborn#1815

flying-sheep commented 4 years ago

Newest numbers:

grafik

scipy.stats is still a big chunk, but we can’t ignore it easily due to all the sklearn imports.