oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.24k stars 172 forks source link

HDBSCAN implementation in 4.3+ #355

Closed abdolence closed 6 months ago

abdolence commented 8 months ago

Hey,

Thanks for your work on this library and for making this open-source.

I was testing different implementations of HDBSCAN, and made a trivial test like this for Tribuo:

public void test() {
       final HdbscanTrainer trainer = new HdbscanTrainer(
                100,
                HdbscanTrainer.Distance.EUCLIDEAN,
                10,
                1
        );

        var outputFactory = new ClusteringFactory();

        var dataProvenance = new SimpleDataSourceProvenance(
                "test",
                outputFactory
        );

        var dataList = new ArrayList<Example<ClusterID>>();

        for (var n = 0; n < 10; n++) {
            for (double i = 1.0; i <= 2.0; i += 0.1) {
                dataList.add(createExample(i));
            }
            for (double i = 10.0; i <= 20.0; i += 0.1) {
                dataList.add(createExample(i));
            }
        }

        var dataSource = new ListDataSource(
                dataList,
                outputFactory,
                dataProvenance
        );

        var trainData = new MutableDataset<ClusterID>(
                dataSource
        );

        HdbscanModel model = trainer.train(trainData);
        System.out.printf("Num clusters: %d%n", new HashSet<>(model.getClusterLabels()).size());

        // Predicting on the same training data
        var predictions = model.predict(trainData);

        for (var prediction : predictions) {
            System.out.printf("%s: %d %f\n",prediction.getExample(), prediction.getOutput().getID(), prediction.getOutput().getScore());
        }
}

private Example<ClusterID> createExample(double value) {
        var example = new ArrayExample<ClusterID>(
                ClusteringFactory.UNASSIGNED_CLUSTER_ID,
                new String[]{"A"},
                new double[]{value}
        );
        return example;
}

Now surprisingly, version 4.2 worked as expected(?) identifying 2 clusters and prediction work as well. For version 4.3 exactly the same code above identifies also 2 clusters, but the prediction returns all as outliers (ClusterID: 0 with score: 0.999900).

Am I missing something or HDBSCAN implementation in 4.3 gives unexpected results?

Is your question about a specific ML algorithm or approach? HDBSCAN implementation in Tribuo.

System details

abdolence commented 8 months ago

The latest v5.0.0-SNAPSHOT seems behave as v4.3.

geoffreydstewart commented 8 months ago

Hi @abdolence, thank you for providing the details of the behavior you're observing. I'll confirm that this is expected functionality and let you know.

geoffreydstewart commented 8 months ago

Hello again @abdolence, the prediction behavior you've reported in 4.3+ is not working as expected with the sample you've provided. We'll look into that, and get back to you. Thanks for reporting the issue!

abdolence commented 8 months ago

Thanks for the updates, looking forward to the resolution and happy to help with the testing of the SNAPSHOT version if needed.