mobilise-d / mobgap

The Mobilise-D algorithm toolbox - Implemented in Python
https://mobgap.readthedocs.io
Apache License 2.0
32 stars 4 forks source link

Properly handle pre-trained ML models for LR #89

Open AKuederle opened 9 months ago

AKuederle commented 9 months ago

Currently we provide pre-trained Sklearn models (#52) for LR detection. They are stored as pickle files. This is an issue, as future versions of sklearn might not be able to load the files.

Ideally we should create a script that can retrain these models from scratch. However, the data the models are originally based on is not public.

Original model was trained with following parameters (optimized in a GridsearchCV (n_folds=5, optimized for accuracy), Model was a SVC with MinMaxScaler

param_grid_svm_lin = [
    {'C': [0.01, 0.1, 1, 10, 100, 1000, 10000], 'kernel': ['linear']}]

For the dataset:

if "msproject" in dataset["name"]:
        hc_list1 = ["C"+str(i) for i in range(1,9)] 
        hc_list2 = ["MS019","MS027","MS036","MS049","MS050","MS051","MS052","MS053","MS056","MS057","MS058","MS059","MS060","MS063","MS064","MS065"]
        if "all" in dataset["name"]:
            patient_list = patient_list
        elif "hc" in dataset["name"]:
            # not only folders with C belong to controls (referring to clinical info from e-science)
            patient_list = np.concatenate([hc_list1, hc_list2])
        else:
            hc_list = np.concatenate([hc_list1, hc_list2])
            patient_list = np.setdiff1d(patient_list,hc_list)
# exclude certain subjects due to data errors

if "MS021" in train_ids.values:
    train_ids = train_ids.loc[train_ids != "MS021"]
if "MS025" in train_ids.values:
    train_ids = train_ids.loc[train_ids != "MS025"]

Further notes:

Eventhought that problem could be completely solved using sklearn, it might be helpful to use the respective tpcp tooling to be able to use the pipelines we have and also create the basis for L/R validation for any algorithm

AKuederle commented 6 months ago

Notes: