a-t-0 commented 1 year ago

Since the results are binary, a logistic regression is used to compute the p-value of the adaptation mechanisms.

# Source: https://stats.stackexchange.com/questions/5935/anova-on-binomial-data
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Data
no_fertilizer = [0, 0, 0, 1, 0, 0,0,0,0,0,0,1,1,1,1, 0]  # Scores for no fertilizer
fertilizer_2kg = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,1,1,1,1, 0]  # Scores for 2kg fertilizer
fertilizer_4kg = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,1,1,1,1, 0]  # Scores for 4kg fertilizer
fertilizer_6kg = [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,1,1,1,1, 0]  # Scores for 6kg fertilizer

# Combine the scores and create the target variable
scores = np.concatenate((no_fertilizer, fertilizer_2kg, fertilizer_4kg, fertilizer_6kg))
fertilizer_amounts = [0] * len(no_fertilizer) + [2] * len(fertilizer_2kg) + [4] * len(fertilizer_4kg) + [6] * len(fertilizer_6kg)

# Create a pandas DataFrame
data = pd.DataFrame({'Scores': scores, 'Fertilizer': fertilizer_amounts})

# Add a constant column for the intercept
data = sm.add_constant(data)

# Perform logistic regression
logit = sm.Logit(data['Scores'], data[['const', 'Fertilizer']])
result = logit.fit()

# Print the results
print(result.summary())

For the sparse redundancy, where more redundancy is not necessarily better, one can heap all the redundancy level results on a single pile and then run the logistical regression results:

# Source: https://stats.stackexchange.com/questions/5935/anova-on-binomial-data
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Data
no_fertilizer = [0, 0, 0, 1, 0, 0,0,0,0,0,0,1,1,1,1, 0]  # Scores for no fertilizer
fertilizer_2kg = [1, 1, 1, 1, 0, 0, 0, 0,1,1,1,1, 0, 1, 1, 1, 1, 1, 1,  0, 0, 0, 0,1,1,1,1, 0]  # Scores for 2kg fertilizer
# Combine the scores and create the target variable
scores = np.concatenate((no_fertilizer, fertilizer_2kg))
fertilizer_amounts = [0] * len(no_fertilizer) + [2] * len(fertilizer_2kg) 

# Create a pandas DataFrame
data = pd.DataFrame({'Scores': scores, 'Fertilizer': fertilizer_amounts})

# Add a constant column for the intercept
data = sm.add_constant(data)

# Perform logistic regression
logit = sm.Logit(data['Scores'], data[['const', 'Fertilizer']])
result = logit.fit()

# Print the results
print(result.summary())

a-t-0 commented 1 year ago

Assumptions logistical regression:

[ ] TODO: verify with:https://www.statology.org/assumptions-of-logistic-regression/
[ ] TODO: verify with: https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290

Binary Outcome: The dependent variable (outcome) should be binary or dichotomous. It should represent two mutually exclusive categories or states.

Linearity of the Logit: The relationship between the independent variables and the log-odds (logit) of the outcome should be linear. This assumption implies that the log-odds should change linearly with the independent variables.

Independence of Observations: The observations should be independent of each other. This assumption assumes that the observations are not correlated or clustered in a way that violates the independence assumption.

No Multicollinearity: The independent variables should not be highly correlated with each other. Multicollinearity can make it difficult to assess the individual effects of the independent variables accurately.

Absence of Outliers: Outliers, which are extreme values that do not follow the general trend of the data, can influence the logistic regression model. It is important to check for and address outliers that may affect the validity of the results.

Large Sample Size: Logistic regression performs better with a larger sample size. A general guideline is to have a minimum of 10-20 events (cases with the outcome of interest) per independent variable to ensure stable estimates.

No Perfect Separation: Perfect separation occurs when there is a combination of independent variables that perfectly predicts the outcome variable. This situation can lead to infinite coefficient estimates, making the model invalid.

Include assumption checks:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

# Data
no_fertilizer = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0]  # Scores for no fertilizer
fertilizer_2kg = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0]  # Scores for 2kg fertilizer
fertilizer_4kg = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0]  # Scores for 4kg fertilizer
fertilizer_6kg = [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]  # Scores for 6kg fertilizer

# Combine the scores and create the target variable
scores = np.concatenate((no_fertilizer, fertilizer_2kg, fertilizer_4kg, fertilizer_6kg))
fertilizer_amounts = [0] * len(no_fertilizer) + [2] * len(fertilizer_2kg) + [4] * len(fertilizer_4kg) + [6] * len(fertilizer_6kg)

# Create a pandas DataFrame
data = pd.DataFrame({'Scores': scores, 'Fertilizer': fertilizer_amounts})

# Add a constant column for the intercept
data = sm.add_constant(data)

# Perform logistic regression
logit = sm.Logit(data['Scores'], data[['const', 'Fertilizer']])
result = logit.fit()

# Print the results
print(result.summary())

# Residual Analysis
# Residual Analysis: A scatter plot of the residuals against the predicted 
# values is created to check for any patterns or trends. The red dashed line
# represents the ideal situation where residuals are randomly distributed 
# around zero.
predicted_values = result.predict()
residuals = data['Scores'] - predicted_values

# Plotting Residuals
sns.scatterplot(x=predicted_values, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()
input(
    "Were the residuals random? "
    +"If yes, then good, if not, then bad, they should be random")

# Checking for multicollinearity
# Multicollinearity: The correlation matrix between the dependent variable and
# the independent variable is computed to assess the presence of 
# multicollinearity. A correlation close to 1 or -1 indicates a high 
# correlation between the variables.
correlation_matrix = data[['Scores', 'Fertilizer']].corr()
print('\nCorrelation Matrix:')
print(correlation_matrix)
input("Are the extremeties of the colums close to 1 or -1?"
    +"If yes, then the multicollinearity assumption is violated.")

# Checking for outliers
# Outlier Analysis: A box plot of the residuals against the fertilizer amounts
# is created.
sns.boxplot(x=data['Fertilizer'], y=residuals)
plt.xlabel('Fertilizer')
plt.ylabel('Residuals')
plt.title('Residuals vs. Fertilizer')
plt.show()

a-t-0 commented 1 year ago

TODO: load actual duration.

Some graphs will be loaded from file, and the radiation requires the actual_duration key in the print(unradiated_graph.network.graph.graph) dict when you run:

python -m src.snncompare -e qt0 -j1 -j2 -j4 -j5

So load that actual sim duration from file when you load those graphs.

a-t-0 commented 1 year ago

[x] Prepend exp_config.unique_id to filenames of coef and p value plots.

uni-courses / snncompare

Compute the statistical significance of the adaptation mechansim results. #206

Assumptions logistical regression:

Include assumption checks:

TODO: load actual duration.