prescient-design / cortex

A Modular Architecture for Deep Learning Systems
Apache License 2.0
36 stars 3 forks source link

Clarification on Numerical Values in GRAPH_OBJ_TRANSFORM #5

Closed dezhi0730 closed 2 months ago

dezhi0730 commented 2 months ago

Hello,

The implementation and thought process of this project are impressive and have been very insightful.

And,I have a question regarding the GRAPH_OBJ_TRANSFORM dictionary in the code. The specific section is as follows:

GRAPH_OBJ_TRANSFORM = {
    "stability": {"scale": 1 / 2.0, "shift": 2.0},
    "log_fluorescence": {"scale": 1 / 7.0, "shift": -4.0},
}

Could you please clarify how the values for "scale" and "shift" were determined for the "stability" and "log_fluorescence" transformations?

Are these values derived from the mean and variance of the training data, or is there another method or rationale behind their selection?

Understanding the origin and reasoning behind these values would help in better comprehending the data preprocessing steps.

Thank you!

Best regards,

samuelstanton commented 2 months ago

hi, thanks for the kind words :)

the intent of the transform is to shift and rescale the objectives to [0, 1]. They are supposed to be computed from the data, e.g.

shift = -1 * data.min()
scale = 1 / (data.max() - data.min())

upon inspection I think I forgot to update these values when I transitioned the code from an internal use case to the public version, thanks for flagging! Will double check that these values are consistent with the data and push an update if not

samuelstanton commented 2 months ago

Looks like they are in fact incorrect. If you're curious, this is how I compute them

import pandas as pd
from cortex.data.dataset import TAPEFluorescenceDataset

train = TAPEFluorescenceDataset(
    root="./.cache",
    download=True,
    train=True,
)
test = TAPEFluorescenceDataset(
    root="./.cache",
    download=True,
    train=False,
)

df = pd.concat([train._data, test._data], ignore_index=True)

min = df.log_fluorescence.min()
range = df.log_fluorescence.max() - min

print(f"Min: {min}")
print(f"Range: {range}")