pepkit / looper

A job submitter for Portable Encapsulated Projects
http://looper.databio.org
BSD 2-Clause "Simplified" License
20 stars 7 forks source link

Speeding up looper CLI use #476

Open nsheff opened 3 months ago

nsheff commented 3 months ago

I'm unsatisified with how long it takes the looper CLI to run. I guess it's because looper imports a bunch of heavy stuff, like pandas, peppy, sqlalchemy (via pephubclient), etc.

A lot of these aren't necessary.

I suggest we see if it's possible to import some of the heaviest things only as needed, instead of at the top of the file as is typically done.

You can profile import time like this:

python -X importtime -c 'import looper'
nleroy917 commented 3 weeks ago

You can view the output with a cool tool called tuna. I ended up running the following to profile the import time:

python -X importtime -c "from looper.__main__ import main; main()" 2> looper.log

I just did this over at geniml, and remembered this issue so I figured while I was on a roll... Also I was struggling with his when running looper recently. Here is the tuna output

image

seems like pandas (in peppy) is a big issue.

Here is the log output: looper.log if someone wanted to download it and run tuna themselves.

donaldcampbelljr commented 1 week ago

I cannot reproduce those slow import times. I get ~0.4-0.56 seconds during import. I tested a fresh venv as well.

donaldcampbelljr commented 3 days ago

Begun some work towards replacing pandas with polars and doing performance testing.

peppy_branch: https://github.com/pepkit/peppy/tree/dev_replace_pandas_with_polars

importing Peppy, Pandas, Looper 50 times, we see a mean and std in miliseconds for import time of:

Using Pandas n=50

──────────────────────────────────── Pandas ────────────────────────────────────
mean    188.684421
std    3.665686
──────────────────────────────────── Peppy ─────────────────────────────────────
mean    244.675341
std    20.345653
──────────────────────────────────── Looper ────────────────────────────────────
mean    470.185256
std    27.921771

Replacing pandas with polars in Peppy: n=50

──────────────────────────────────── Polars ────────────────────────────────────    
mean   51.336722
std   11.519378
──────────────────────────────────── Peppy ─────────────────────────────────────    
mean   185.085058
std   42.192459

Note, I did not test Looper with the polars replacement yet because I realized that Looper will pull in pandas from Peppy, Ubiquerg, and Pipestat so it was becoming difficult to pull out pandas completely.