Jupyter implementation for new datasets

tydymy commented 1 year ago

Hello, Is there a way to run this classifier/baseline comparison using a csv or dataframe as input with a classifier label?

rolan2kn commented 1 year ago

I have a new version that uses pandas. What you ask can be done by including a new data type PANDAS or CUSTOM in dataset_tools/dataset_types.py, then implementing a new method in dataset_tools/dataset_builder.py and that's all. I do have not much time now, but I will submit it this week.

tydymy commented 1 year ago

Thank you. Is there a specific file in the repository I can use to Implement steps 1-4 in the manuscript? I don't need the simulated dataset and metrics but I would like to try the classification algorithm on my own simplicial complexes.

rolan2kn commented 1 year ago

Basically, the algorithm in the paper is like this

from dataset_tools.dataset_handler import DatasetHandler

dim = 3              # my desired complex dimension
X,Y = load_my_data() # your loading method which return the X \subset R^n and Y the tags 

dataset_handler = DatasetHandler()  
dataset_handler.load_dataset(X, Y)  # this will call the load_from_scratch method in DatasetBuilder 
                                    # now your data handler contains your data.

rips_tdabc = RepeatedCVPredictProba(data_handler=dataset_handler,
                    complex_type= SimplicialComplexType(type=SimplicialComplexType.RIPS, max_dim=dim),
                    fold_sequence=NORMAL,
                    selector_type=SelectorTypeHandler(SelectorTypeHandler.AVERAGE|SelectorTypeHandler.MAXIMAL|
                                                      SelectorTypeHandler.RANDOMIZED|SelectorTypeHandler.MEDIAN),

                    classifier_ev= ClassifierTypeHandler.PTDABC | ClassifierTypeHandler.KNN | ClassifierTypeHandler.WKNN
                                   | ClassifierTypeHandler.RF | ClassifierTypeHandler.LSVM,
                    algorithm_mode=FilteredSimplicialComplexBuilder.DIRECT, pi_stage=PersistenceIntervalStage.DEATH,
                    # ph_solver=PersistentHomologySolver.GUDHI, **kwargs)
                    ph_solver=PersistentHomologySolver.GIOTTO, **kwargs)
                    # ph_solver=PersistentHomologySolver.RIPSER, **kwargs)

results = rips_tdabc.execute()

But as I said, this was done in that way to have enough flexibility from the entry point to get my benchmarks. Then the Dataset4Test class encapsulates the data selection, DataTransformation allows preprocessing of that data. A cycle for testing different dimensions and so on. If you do not want to use baseline methods, you must pass ClassifierTypeHandler.PTDABC is the tda-based classifier explained in this paper. There is not mandatory to use the RepeatedCVPredictProba, but in that case, you should be focused on the methods execute, tda_execute, and fit of that class.

I hope this helps.

rolan2kn / TDABC-4-ADAC

Jupyter implementation for new datasets #1