munch2024 / munch

2 stars 11 forks source link

Complex Machine Learning Code Issue #84

Closed tadakane closed 8 months ago

tadakane commented 8 months ago

I've found a code snippet from a machine learning python project of mine. It calculates a linear regression coefficient vector based on matrix X of independent variables and vector Y of dependent variables. There is almost no clarity as to what the code snippet is calculating and is a long statement of the '@' operator being used over and over.

The current implementation is extremely complex using '@' multiple times, which is the matrix multiplication operator, but makes the code really hard to understand. This makes the code snippet unnecessarily convoluted and thus gives it poor readability and hard to understand. Given this, along with functions available in numpy and pandas, there should be a simpler and potentially more efficient way to rewrite this code.

Goals for Refactoring:

  1. Improve readability of the code by simplifying the calculation functions
  2. Improve efficiency by using already available functions if possible

Below is the attached code snippet of the issue.

Code Snippet:

import numpy as np
import pandas as pd

independent = pd.DataFrame(np.random.random((3, 3)))
dependent = pd.DataFrame(np.random.random((3, 1)))

b = pd.DataFrame(np.linalg.inv((independent.T) @ independent), independent.columns, independent.columns) @ independent.T @ dependent

print(b)