pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

Significant performance degradation in 0.9.1 for SparseDataFrame methods like to_dense() and save() and for arithmetic operations #2273

Closed bluefir closed 12 years ago

bluefir commented 12 years ago

This is what I have in version 0.9.0:

import pandas as pd
pd.__version__

'0.9.0'

barra_industry_exposures

<class 'pandas.core.frame.DataFrame'> MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10') Data columns: MINING_METALS 253738 non-null values GOLD 253738 non-null values FORESTRY_PAPER 253738 non-null values CHEMICAL 253738 non-null values ENERGY_RESERVES 253738 non-null values OIL_REFINING 253738 non-null values OIL_SERVICES 253738 non-null values FOOD_BEVERAGES 253738 non-null values ALCOHOL 253738 non-null values TOBACCO 253738 non-null values HOME_PRODUCTS 253738 non-null values GROCERY_STORES 253738 non-null values CONSUMER_DURABLES 253738 non-null values MOTOR_VEHICLES 253738 non-null values APPAREL_TEXTILES 253738 non-null values CLOTHING_STORES 253738 non-null values SPECIALTY_RETAIL 253738 non-null values DEPARTMENT_STORES 253738 non-null values CONSTRUCTION 253738 non-null values PUBLISHING 253738 non-null values MEDIA 253738 non-null values HOTELS 253738 non-null values RESTAURANTS 253738 non-null values ENTERTAINMENT 253738 non-null values LEISURE 253738 non-null values ENVIRONMENTAL_SERVICES 253738 non-null values HEAVY_ELECTRICAL_EQUIPMENT 253738 non-null values HEAVY_MACHINERY 253738 non-null values INDUSTRIAL_PARTS 253738 non-null values ELECTRICAL_UTILITY 253738 non-null values GAS_WATER_UTILITY 253738 non-null values RAILROADS 253738 non-null values AIRLINES 253738 non-null values FREIGHT 253738 non-null values MEDICAL_SERVICES 253738 non-null values MEDICAL_PRODUCTS 253738 non-null values DRUGS 253738 non-null values ELECTRONIC_EQUIPMENT 253738 non-null values SEMICONDUCTORS 253738 non-null values COMPUTER_HARDWARE 253738 non-null values COMPUTER_SOFTWARE 253738 non-null values DEFENCE_AEROSPACE 253738 non-null values TELEPHONE 253738 non-null values WIRELESS 253738 non-null values INFORMATION_SERVICES 253738 non-null values INDUSTRIAL_SERVICES 253738 non-null values LIFE_HEALTH_INSURANCE 253738 non-null values PROPERTY_CASUALTY_INSURANCE 253738 non-null values BANKS 253738 non-null values THRIFTS 253738 non-null values ASSET_MANAGEMENT 253738 non-null values FINANCIAL_SERVICES 253738 non-null values INTERNET 253738 non-null values REITS 253738 non-null values BIOTECH 253738 non-null values dtypes: int64(55)

sparse = barra_industry_exposures.to_sparse(fill_value=0)
sparse

<class 'pandas.sparse.frame.SparseDataFrame'> MultiIndex: 253738 entries, (20061229, '00036110') to (20120928, 'Y8564W10') Columns: 55 entries, AIRLINES to WIRELESS dtypes: float64(55)

%timeit sparse / 100.

100 loops, best of 3: 6.64 ms per loop

%timeit sparse.to_dense()

10 loops, best of 3: 127 ms per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 16.9 ms per loop

Now this is what I get in 0.9.1:

import pandas as pd
pd.__version__

'0.9.1'

%timeit sparse / 100.

1 loops, best of 3: 92.2 s per loop

%timeit sparse.to_dense()

1 loops, best of 3: 99.8 s per loop

%timeit sparse.save('test.pkl')

1 loops, best of 3: 100 s per loop

So, in the new version SparseDataFrame methods that used to run in less than 7-130 ms now run in more than 90 s. Ouch! What happened?

changhiskhan commented 12 years ago

We need more performance benchmarks in the vbench suite. Thanks for the feedback. We'll investigate.

ghost commented 12 years ago

This looks like 4a5b75b44b0048, though I'm not sure why take is so expensive. the pending #2253 (3688e53) fixes the problem for me.

Testcase:

import pandas as pd
num=250000
l1=[randint(0,1000) for x in range(num)]
l2=[randint(0,20000) for x in range(num)]
l3=[randint(0,20000) for x in range(num)]
l4=[randint(0,20000) for x in range(num)]
a=pd.DataFrame(dict(zip([0,1,2,3],[l1,l2,l3,l4]))).set_index([0,1])
b=a.to_sparse()
%timeit b/100
%timeit b.to_dense()
%timeit b.save('test.pk1')

Edit: but perhaps there's another issue at play. I can't reproduce anything like 90s runtime on this data

wesm commented 12 years ago

Doh, this will teach me to review PRs more carefully; this is theoretically what vbench is for. I will fix

wesm commented 12 years ago

Ugh, iteritems for all DataFrames has borked performance. Guess we're going to see 0.9.2 sooner rather than later