Success stories using Scalene's optimization suggestions

emeryberger commented 1 year ago

This thread is meant to document and collect successful optimizations suggested by Scalene (vs. the more general success stories given here: https://github.com/plasma-umass/scalene/issues/58).

emeryberger commented 1 year ago

Speedup of 2x, memory consumption dropped by 1/2

Issue https://github.com/plasma-umass/scalene/issues/31 includes the following code, which we turn into a function:

def main1():
    x = np.array(range(10**7))
    y = np.array(np.random.uniform(0, 100, size=10**8))

Scalene (https://github.com/plasma-umass/scalene/commit/2c18f59fb938b6b8ce4b525ca553275eacbf5fd2) suggests the following optimization for this function:

# Proposed optimization:
# Vectorized operations are used to reduce the time and memory complexity of the code.
def main1():
    # After optimization
    x = np.arange(10**7) # np.arange is faster than np.array
    y = np.random.uniform(0, 100, size=10**8) # No need to wrap in np.array

This optimization speeds the code by almost 2x, and halves memory consumption (from 1.5GB to 773MB).

emeryberger commented 1 year ago

90x speedup

https://github.com/plasma-umass/scalene/issues/58#issuecomment-720252786 presents the following code:

for i in range(n_features):
    for n in range(n_samples):
        subgrad[i] += (- y[n] * X[n][i]) if y[n] * (np.dot(X[n], w) + b) < 1 else 0
    subgrad[i] += self.lambda1 * (-1 if w[i] < 0 else 1) + 2 * self.lambda2 * w[i]

Scalene proposes the following optimization:

# Vectorized operations to replace for loops                                                                                                    
subgrad[:-1] = np.sum(-y[:, None] * X * (y * (X.dot(w) + b) < 1)[:, None], axis=0)
subgrad[:-1] += self.lambda1 * np.sign(w) + 2 * self.lambda2 * w
subgrad[-1] = np.sum(-y * (y * (X.dot(w) + b) < 1))

Scalene's proposed optimization accelerates the original code by at least 90x (89 seconds to 1 second, when running 500 iterations), and takes full advantage of multiple cores.

Original code

```Python # original code def subgradient(self, wb, X, y): """Compute the subgradient of the objective function. Arguments: wb (ndarray, shape = (n_features+1,)): concatenation of the weight vector with the bias wb=[w,b] X (ndarray, shape = (n_samples, n_features)): Training input matrix where each row is a feature vector. The data in X are passed in without a bias column! y (ndarray, shape = (n_samples,)): Training target. Each entry is either -1 or 1. Returns: subgrad (ndarray, shape = (n_features+1,)): subgradient of the objective function with respect to the coefficients wb=[w,b] of the linear model """ n_samples = X.shape[0] n_features = X.shape[1] w = wb[:-1] b = wb[-1] subgrad = np.zeros(n_features + 1) for i in range(n_features): for n in range(n_samples): subgrad[i] += (- y[n] * X[n][i]) if y[n] * (np.dot(X[n], w) + b) < 1 else 0 subgrad[i] += self.lambda1 * (-1 if w[i] < 0 else 1) + 2 * self.lambda2 * w[i] for n in range(n_samples): subgrad[-1] += - y[n] if y[n] * (np.dot(X[n], w) + b) < 1 else 0 return subgrad ```

Optimized code

```Python # Proposed optimization: # This code has been optimized by replacing the for loops with vectorized operations. This reduces the time complexity from O(n^2) to O(n), resulting in a substantial speedup. def subgradient(self, wb, X, y): """Compute the subgradient of the objective function. Arguments: wb (ndarray, shape = (n_features+1,)): concatenation of the weight vector with the bias wb=[w,b] X (ndarray, shape = (n_samples, n_features)): Training input matrix where each row is a feature vector. The data in X are passed in without a bias column! y (ndarray, shape = (n_samples,)): Training target. Each entry is either -1 or 1. Returns: subgrad (ndarray, shape = (n_features+1,)): subgradient of the objective function with respect to the coefficients wb=[w,b] of the linear model """ n_samples = X.shape[0] n_features = X.shape[1] w = wb[:-1] b = wb[-1] # Vectorized operations to replace for loops subgrad = np.zeros(n_features + 1) subgrad[:-1] = np.sum(-y[:, None] * X * (y * (X.dot(w) + b) < 1)[:, None], axis=0) subgrad[:-1] += self.lambda1 * np.sign(w) + 2 * self.lambda2 * w subgrad[-1] = np.sum(-y * (y * (X.dot(w) + b) < 1)) return subgrad ```

emeryberger commented 1 year ago

18x speedup over original; 1.8x speedup over manual optimization

https://github.com/plasma-umass/scalene/issues/58#issuecomment-812953671 presents the following code (with some edits to put the bottleneck code into a function do_it():

column_names_example = [i for i in range(10000)]
index = pd.MultiIndex.from_tuples([("left", c) for c in column_names_example] + [("right", c) for c in column_names_example])
df = pd.DataFrame(np.random.rand(1000, 20000), columns=index)

def keep_column(left_col, right_col):
  return left_col[left_col.first_valid_index()] > right_col[right_col.last_valid_index()]

def do_it():
  v = [c for c in column_names_example if keep_column(df["left"][c], df["right"][c])]

That code takes about 6.54 seconds on a Mac Powerbook M1.

After about a dozen attempts (primarily including ones that did not run), Scalene produced a solid optimization:

# Proposed optimization: Replaced list comprehension with vectorized operation. 
def do_it():
    v = df["left"].loc[df["left"].first_valid_index()] > df["right"].loc[df["right"].last_valid_index()]
    return v.index[v].tolist()

That code runs in 0.37 seconds, corresponding to an almost 18x increase in performance. The author of the comment had previously produced the following optimization:

def do_it():
    df_l = df["left"]
    df_r = df["right"]
    return [c for c in column_names_example if keep_column(df_l[c], df_r[c])]

That code takes 0.67 seconds, making Scalene's optimization roughly 1.8x faster.

ianozsvald commented 1 month ago

@emeryberger you may be happy to hear that I've added Scalene to the 3rd edition of our High Performance Python book in the Profiling chapter, it is due to publication in 2025. Thanks!

plasma-umass / scalene