microsoft / pyright

Static Type Checker for Python
Other
13.27k stars 1.44k forks source link

Slow performance in lightGBM `get_data(self):` #4940

Closed bschnurr closed 1 year ago

bschnurr commented 1 year ago

Note: if you are reporting a wrong signature of a function or a class in the standard library, then the typeshed tracker is better suited for this report: https://github.com/python/typeshed/issues.

Describe the bug A clear and concise description of what the bug is.

40 seconds to analyze def get_data(self): in basic.py in the lightGBM package

https://github.com/microsoft/LightGBM/blob/ca035b2ee0c2be85832435917b1e0c8301d2e0e0/python-package/lightgbm/basic.py#L2307

To Reproduce Steps to reproduce the behavior.

open the code below after pip installing requirements.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots or Code If applicable, add screenshots or the text of the code (surrounded by triple back ticks) to help explain your problem. code here too.

import pandas as pd
import numpy as np
import lightgbm as lgb
#import xgboost as xgb
from scipy.sparse import vstack, csr_matrix, save_npz, load_npz
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import gc

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.compose import TransformedTargetRegressor
from sklearn.metrics import median_absolute_error
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

gc.enable()

dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float16',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int8',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float32',
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float16',
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float16',
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float32',
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float32',
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float16',
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float16',
        'Census_InternalPrimaryDisplayResolutionVertical':      'float16',
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float32',
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }

# scikit-learn examples
survey = fetch_openml(data_id=534, as_frame=True)
X = survey.data[survey.feature_names]
X.describe(include="all")
X.head()
y = survey.target.values.ravel()
survey.target.head()
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
survey.data.info()

categorical_columns = ['RACE', 'OCCUPATION', 'SECTOR',
                       'MARR', 'UNION', 'SEX', 'SOUTH']
numerical_columns = ['EDUCATION', 'EXPERIENCE', 'AGE']

preprocessor = make_column_transformer(
    (OneHotEncoder(drop='if_binary'), categorical_columns),
    remainder='passthrough'
)

model = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=Ridge(alpha=1e-10),
        func=np.log10,
        inverse_func=sp.special.exp10
    )
)
_ = model.fit(X_train, y_train)

mae = median_absolute_error(y_train, y_pred)
string_score = f'MAE on training set: {mae:.2f} $/hour'
y_pred = model.predict(X_test)
mae = median_absolute_error(y_test, y_pred)
string_score += f'\nMAE on testing set: {mae:.2f} $/hour'
fig, ax = plt.subplots(figsize=(5, 5))
plt.scatter(y_test, y_pred)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c="red")
plt.text(3, 20, string_score)
plt.title('Ridge model, small regularization')
plt.ylabel('Model predictions')
plt.xlabel('Truths')
plt.xlim([0, 27])
_ = plt.ylim([0, 27])

feature_names = (model.named_steps['columntransformer']
                      .named_transformers_['onehotencoder']
                      .get_feature_names(input_features=categorical_columns))
feature_names = np.concatenate(
    [feature_names, numerical_columns])

coefs = pd.DataFrame(
    model.named_steps['transformedtargetregressor'].regressor_.coef_,
    columns=['Coefficients'], index=feature_names
)

coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model, small regularization')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

X_train_preprocessed = pd.DataFrame(
    model.named_steps['columntransformer'].transform(X_train),
    columns=feature_names
)

X_train_preprocessed.std(axis=0).plot(kind='barh', figsize=(9, 7))
plt.title('Features std. dev.')
plt.subplots_adjust(left=.3)

coefs = pd.DataFrame(
    model.named_steps['transformedtargetregressor'].regressor_.coef_ *
    X_train_preprocessed.std(axis=0),
    columns=['Coefficient importance'], index=feature_names
)
coefs.plot(kind='barh', figsize=(9, 7))
plt.title('Ridge model, small regularization')
plt.axvline(x=0, color='.5')
plt.subplots_adjust(left=.3)

cv_model = cross_validate(
    model, X, y, cv=RepeatedKFold(n_splits=5, n_repeats=5),
    return_estimator=True, n_jobs=-1
)
coefs = pd.DataFrame(
    [est.named_steps['transformedtargetregressor'].regressor_.coef_ *
     X_train_preprocessed.std(axis=0)
     for est in cv_model['estimator']],
    columns=feature_names
)
plt.figure(figsize=(9, 7))
sns.swarmplot(data=coefs, orient='h', color='k', alpha=0.5)
sns.boxplot(data=coefs, orient='h', color='cyan', saturation=0.5)
plt.axvline(x=0, color='.5')
plt.xlabel('Coefficient importance')
plt.title('Coefficient importance and its variability')
plt.subplots_adjust(left=.3)

# end of scikit learn example

print('Download Train and Test Data.\n')
train = pd.read_csv('../input/train.csv', dtype=dtypes, low_memory=True)
train['MachineIdentifier'] = train.index.astype('uint32')
test  = pd.read_csv('../input/test.csv',  dtype=dtypes, low_memory=True)
test['MachineIdentifier']  = test.index.astype('uint32')

gc.collect()

print('Transform all features to category.\n')
for usecol in train.columns.tolist()[1:-1]:

    train[usecol] = train[usecol].astype('str')
    test[usecol] = test[usecol].astype('str')

    #Fit LabelEncoder
    le = LabelEncoder().fit(
            np.unique(train[usecol].unique().tolist()+
                      test[usecol].unique().tolist()))

    #At the end 0 will be used for dropped values
    train[usecol] = le.transform(train[usecol])+1
    test[usecol]  = le.transform(test[usecol])+1

    agg_tr = (train
              .groupby([usecol])
              .aggregate({'MachineIdentifier':'count'})
              .reset_index()
              .rename({'MachineIdentifier':'Train'}, axis=1))
    agg_te = (test
              .groupby([usecol])
              .aggregate({'MachineIdentifier':'count'})
              .reset_index()
              .rename({'MachineIdentifier':'Test'}, axis=1))

    agg = pd.merge(agg_tr, agg_te, on=usecol, how='outer').replace(np.nan, 0)
    #Select values with more than 1000 observations
    agg = agg[(agg['Train'] > 1000)].reset_index(drop=True)
    agg['Total'] = agg['Train'] + agg['Test']
    #Drop unbalanced values
    agg = agg[(agg['Train'] / agg['Total'] > 0.2) & (agg['Train'] / agg['Total'] < 0.8)]
    agg[usecol+'Copy'] = agg[usecol]

    train[usecol] = (pd.merge(train[[usecol]], 
                              agg[[usecol, usecol+'Copy']], 
                              on=usecol, how='left')[usecol+'Copy']
                     .replace(np.nan, 0).astype('int').astype('category'))

    test[usecol]  = (pd.merge(test[[usecol]], 
                              agg[[usecol, usecol+'Copy']], 
                              on=usecol, how='left')[usecol+'Copy']
                     .replace(np.nan, 0).astype('int').astype('category'))

    del le, agg_tr, agg_te, agg, usecol
    gc.collect()

y_train = np.array(train['HasDetections'])
train_ids = train.index
test_ids  = test.index

del train['HasDetections'], train['MachineIdentifier'], test['MachineIdentifier']
gc.collect()

print("If you don't want use Sparse Matrix choose Kernel Version 2 to get simple solution.\n")

print('--------------------------------------------------------------------------------------------------------')
print('Transform Data to Sparse Matrix.')
print('Sparse Matrix can be used to fit a lot of models, eg. XGBoost, LightGBM, Random Forest, K-Means and etc.')
print('To concatenate Sparse Matrices by column use hstack()')
print('Read more about Sparse Matrix https://docs.scipy.org/doc/scipy/reference/sparse.html')
print('Good Luck!')
print('--------------------------------------------------------------------------------------------------------')

#Fit OneHotEncoder
ohe = OneHotEncoder(categories='auto', sparse=True, dtype='uint8').fit(train)

#Transform data using small groups to reduce memory usage
m = 100000
train = vstack([ohe.transform(train[i*m:(i+1)*m]) for i in range(train.shape[0] // m + 1)])
test  = vstack([ohe.transform(test[i*m:(i+1)*m])  for i in range(test.shape[0] // m +  1)])
save_npz('train.npz', train, compressed=True)
save_npz('test.npz',  test,  compressed=True)

del ohe, train, test
gc.collect()

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf.get_n_splits(train_ids, y_train)

lgb_test_result  = np.zeros(test_ids.shape[0])
lgb_train_result = np.zeros(train_ids.shape[0])
#xgb_test_result  = np.zeros(test_ids.shape[0])
#xgb_train_result = np.zeros(train_ids.shape[0])
counter = 0

print('\nLightGBM\n')

for train_index, test_index in skf.split(train_ids, y_train):

    print('Fold {}\n'.format(counter + 1))

    train = load_npz('train.npz')
    X_fit = vstack([train[train_index[i*m:(i+1)*m]] for i in range(train_index.shape[0] // m + 1)])
    X_val = vstack([train[test_index[i*m:(i+1)*m]]  for i in range(test_index.shape[0] //  m + 1)])
    X_fit, X_val = csr_matrix(X_fit, dtype='float32'), csr_matrix(X_val, dtype='float32')
    y_fit, y_val = y_train[train_index], y_train[test_index]

    del train
    gc.collect()

    lgb_model = lgb.LGBMClassifier(max_depth=-1,
                                   n_estimators=30000,
                                   learning_rate=0.05,
                                   num_leaves=2**12-1,
                                   colsample_bytree=0.28,
                                   objective='binary', 
                                   n_jobs=-1)

    #xgb_model = xgb.XGBClassifier(max_depth=6,
    #                              n_estimators=30000,
    #                              colsample_bytree=0.2,
    #                              learning_rate=0.1,
    #                              objective='binary:logistic', 
    #                              n_jobs=-1)

    lgb_model.fit(X_fit, y_fit, eval_metric='auc', 
                  eval_set=[(X_val, y_val)], 
                  verbose=100, early_stopping_rounds=100)

    #xgb_model.fit(X_fit, y_fit, eval_metric='auc', 
    #              eval_set=[(X_val, y_val)], 
    #              verbose=1000, early_stopping_rounds=300)

    lgb_train_result[test_index] += lgb_model.predict_proba(X_val)[:,1]
    #xgb_train_result[test_index] += xgb_model.predict_proba(X_val)[:,1]

    del X_fit, X_val, y_fit, y_val, train_index, test_index
    gc.collect()

    test = load_npz('test.npz')
    test = csr_matrix(test, dtype='float32')
    lgb_test_result += lgb_model.predict_proba(test)[:,1]
    #xgb_test_result += xgb_model.predict_proba(test)[:,1]
    counter += 1

    del test
    gc.collect()

    #Stop fitting to prevent time limit error
    #if counter == 3 : break

print('\nLigthGBM VAL AUC Score: {}'.format(roc_auc_score(y_train, lgb_train_result)))
#print('\nXGBoost VAL AUC Score: {}'.format(roc_auc_score(y_train, xgb_train_result)))

submission = pd.read_csv('../input/sample_submission.csv')
submission['HasDetections'] = lgb_test_result / counter
submission.to_csv('lgb_submission.csv', index=False)
#submission['HasDetections'] = xgb_test_result / counter
#submission.to_csv('xgb_submission.csv', index=False)
#submission['HasDetections'] = 0.5 * lgb_test_result / counter  + 0.5 * xgb_test_result / counter 
#submission.to_csv('lgb_xgb_submission.csv', index=False)

print('\nDone.')

import pytz
from datetime import datetime

# assuming now contains a timezone aware datetime
pactz = pytz.timezone('America/Los_Angeles')
loc_dt = pactz.localize(datetime(2019, 10, 27, 6, 0, 0))
utcnow = pytz.utc
print(pytz.all_timezones)
dt = datetime(2019, 10, 31, 23, 30)
print (pactz.utcoffset(dt, is_dst=True))

def do_plotly():
    import plotly.graph_objs as go
    fig = go.Figure()
    fig.add_scatter

requirements.txt

contourpy==1.0.7
cycler==0.11.0
fonttools==4.39.0
importlib-resources==5.12.0
joblib==1.2.0
kiwisolver==1.4.4
lightgbm==3.3.5
matplotlib==3.7.1
numpy==1.24.2
packaging==23.0
pandas==1.5.3
Pillow==9.4.0
plotly==5.13.1
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.7.1
scikit-learn==1.2.2
scipy==1.10.1
seaborn==0.12.2
six==1.16.0
tenacity==8.2.2
threadpoolctl==3.1.0
zipp==3.15.0

If your code relies on symbols that are imported from a third-party library, include the associated import statements and specify which versions of those libraries you have installed.

VS Code extension or command-line Are you running pyright as a VS Code extension or a command-line tool? Which version? You can find the version of the VS Code extension by clicking on the Pyright icon in the extensions panel.

Additional context Add any other context about the problem here.

slow at line Re ["self.data.iloc[self.used_indic <shortened> " (lightgbm.basic) [2323:33]] (10856ms) [f:0, t:1, p:0, i:0, b:0]

(39760) [BG(1)]                                           Re ["concat" (lightgbm.basic) [2470:33]] (3ms) [f:0, t:1, p:0, i:0, b:0]
(39760) [BG(1)]                                         Re ["self.data.getformat" (lightgbm.basic) [2455:33]] (568ms) [f:0, t:1, p:0, i:0, b:0]
(39760) [BG(1)]                                       Re ["self.data.iloc[self.used_indic <shortened> " (lightgbm.basic) [2323:33]] (10856ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10856ms)
(39760) [BG(1)]                                     Re ["self.get_data" (lightgbm.basic) [1805:25]] (10882ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10882ms)
(39760) [BG(1)]                                   Re ["self.set_group" (lightgbm.basic) [1807:25]] (10882ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10882ms)
(39760) [BG(1)]                                 Re ["self.get_label" (lightgbm.basic) [1808:24]] (10884ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10884ms)
(39760) [BG(1)]                               Re ["self.reference._predictor" (lightgbm.basic) [1810:96]] (10884ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10884ms)
(39760) [BG(1)]                             Re ["self.get_data" (lightgbm.basic) [1811:25]] (10884ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10884ms)
(39760) [BG(1)]                           Re ["self._set_init_score_by_predic <shortened> " (lightgbm.basic) [1812:25]] (10884ms) [f:0, t:1, p:0, i:0, b:0]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (10884ms)
(39760) [BG(1)]                         Re ["train_set.construct" (lightgbm.basic) [2605:13]] (11435ms) [f:1, t:1, p:2, i:3, b:1]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (11435ms)
(39760) [BG(1)]                       Re ["params.update" (lightgbm.basic) [2607:13]] (11440ms) [f:1, t:1, p:2, i:3, b:1]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (11440ms)
(39760) [BG(1)]                     Re ["predictor.predict" (lightgbm.basic) [3538:16]] (11630ms) [f:2, t:2, p:2, i:3, b:3]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (11630ms)
(39760) [BG(1)]                   Re ["self._Booster.predict" (lightgbm.sklearn) [803:16]] (11632ms) [f:2, t:2, p:2, i:3, b:3]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (11632ms)
(39760) [BG(1)]                 Re ["super().predict" (lightgbm.sklearn) [997:18]] (11633ms) [f:2, t:2, p:2, i:3, b:3]
[Info  - 11:32:00 AM] (39760) [BG(1)] Long operation: Re (11633ms)
(39760) [BG(1)]               Re ["lgb_model.predict_proba" (detector) [349:37]] (11998ms) [f:5, t:13, p:25, i:5, b:36]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: Re (11998ms)
(39760) [BG(1)]             Re ["gc.collect" (detector) [353:5]] (11999ms) [f:5, t:13, p:25, i:5, b:36]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: Re (11999ms)
(39760) [BG(1)]           Re ["load_npz" (detector) [355:12]] (11999ms) [f:5, t:13, p:25, i:5, b:36]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: Re (11999ms)
(39760) [BG(1)]         Re ["csr_matrix" (detector) [356:12]] (11999ms) [f:5, t:13, p:25, i:5, b:36]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: Re (11999ms)
(39760) [BG(1)]       Re ["lgb_model.predict_proba" (detector) [357:24]] (12088ms) [f:6, t:17, p:31, i:57, b:46]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: Re (12088ms)
(39760) [BG(1)]     Re ["gc.collect" (detector) [362:5]] (12088ms) [f:6, t:17, p:31, i:57, b:46]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: Re (12088ms)
(39760) [BG(1)]   getDeclarationsForNameNode ["format" (detector) [314:23]] (12090ms) [f:6, t:17, p:31, i:57, b:46]
[Info  - 11:32:01 AM] (39760) [BG(1)] Long operation: getDeclarationsForNameNode (12090ms)
(39760) [BG(1)]   getDeclarationsForNameNode ...
erictraut commented 1 year ago

I'm not able to repro slow performance given the steps above. Is there other steps? Do I need to uncomment something? I don't see any calls to get_data in the code. Can you repro the problem with a more minimal code sample?

bschnurr commented 1 year ago

Oh sorry, you probably need to enable useLibraryCodeForTypes

and from the code example. its the line related to lgb_model.predict_proba( I belive then indirectly it will go to lightGBM's get_data(

erictraut commented 1 year ago

This is a duplicate of https://github.com/microsoft/pyright/issues/4787. I've already spent a bunch of time looking into this one and optimizing it. I don't think there's much more I can do here. The code uses very deep call chains across multiple libraries (lightgbm, sklearn, scipy, numpy), and most of these are untyped libraries, so type inference needs to be used in all of these cases.