oracle / tribuo

Tribuo - A Java machine learning library
https://tribuo.org
Apache License 2.0
1.24k stars 172 forks source link

SQLDataSource Example #366

Closed lazydog2 closed 2 months ago

lazydog2 commented 2 months ago

Ask the question I'm trying to use an SQLDataSource for HDBSCAN clustering and would appreciate an example of using SQLDataSource as there doesn't seem to be one in the documentation.

Is your question about a specific ML algorithm or approach? HDBSCAN

Is your question about a specific Tribuo class? SQLDataSource

System details

Additional context N/A

Craigacp commented 2 months ago

Aside from the SQL connection information SQLDataSource behaves exactly like CSVDataSource in terms of how Tribuo processes the data, which you can see in the columnar data tutorial. Is there something about how the SQL connection works that we should document better? I think internally we may only have used it against Oracle DBs, but it should talk to anything that works via JDBC.

lazydog2 commented 2 months ago

Thanks @Craigacp, I've been working my way through the documentation and figuring out how to construct the SQLDataSource. The SQLDataSource constructor includes an OutputFactory argument that the CSVDataSource doesn't but I believe this could be a ClusteringFactory when used to perform clustering using HDBSCAN but a SQLDataSource example, whether it be for HDBSCAN or another technique, would help determine the correct usage of SQLDataSource for someone who is unfamiliar with Tribuo.

Craigacp commented 2 months ago

Yes, you should use a ClusteringFactory for HDBSCAN. The SQLDataSource should probably be refactored to pull its output factory from the RowProcessor, which is how the CSVDataSource works. We must have missed that in the refactor which unified the columnar processing infrastructure.

If you have a ground truth clustering that you want to load in for later comparison then you'll need to write a ResponseProcessor which creates the appropriate ClusterID object from your source. Otherwise use EmptyResponseProcessor and pass it a ClusteringFactory.

lazydog2 commented 2 months ago

Thanks @Craigacp. Is the RowProcessor tutorial you mention in response to leccelecce's comment below, the same columnar data tutorial you mentioned in your comment above? https://github.com/oracle/tribuo/issues/50#issuecomment-709612052

While not related to this SQLDataSource Example issue, I am also looking for an example of using a non-double Feature value e.g. a BigInteger.

Craigacp commented 2 months ago

Yes, that's the one.

All Tribuo feature values are doubles, you can use RowProcessor to transform other things into features but the values must be doubles as that's what all the computational backends support.

lazydog2 commented 2 months ago

Hi @Craigacp, I'm having trouble with either the SQLDataSource or a custom Processor. When I run the code, it produces the following output:

INFO: Iterated over 1000000 rows
[WARNING] 
java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
    at org.tribuo.math.neighbour.bruteforce.NeighboursBruteForce.<init> (NeighboursBruteForce.java:48)
    at org.tribuo.math.neighbour.bruteforce.NeighboursBruteForceFactory.createNeighboursQuery (NeighboursBruteForceFactory.java:99)
    at org.tribuo.math.neighbour.bruteforce.NeighboursBruteForceFactory.createNeighboursQuery (NeighboursBruteForceFactory.java:38)
    at org.tribuo.clustering.hdbscan.HdbscanTrainer.calculateCoreDistances (HdbscanTrainer.java:335)
    at org.tribuo.clustering.hdbscan.HdbscanTrainer.train (HdbscanTrainer.java:265)
    at org.tribuo.clustering.hdbscan.HdbscanTrainer.train (HdbscanTrainer.java:302)

I've put a print statement in the process method of my CustomFieldProcessor and it doesn't seem to ever be called. Can you see anything in the code below that is missing or hasn't been done properly?

Map<String, FieldProcessor> fieldProcessors = new HashMap<>();
fieldProcessors.put("feature1", new CustomFieldProcessor("feature1"));
fieldProcessors.put("feature2", new CustomFieldProcessor("feature2"));
fieldProcessors.put("feature3", new CustomFieldProcessor("feature3"));
RowProcessor<ClusterID> rowProcessor = new RowProcessor<>(new EmptyResponseProcessor<>(new ClusteringFactory()), fieldProcessors);
SQLDataSource<ClusterID> datasource = new SQLDataSource<>(
    query,
    new SQLDBConfig(connectionString, new HashMap<String, String>()),
    new ClusteringFactory(),
    rowProcessor,
    true
);
Dataset<ClusterID> dataset = new MutableDataset<>(datasource);
HdbscanTrainer trainer = new HdbscanTrainer(5);
HdbscanModel model = trainer.train(dataset);
lazydog2 commented 2 months ago

I'm not sure exactly what I changed, it would have only been something minor, but it is now working.

lazydog2 commented 2 months ago

I think the fix was changing the outputRequired argument of the SQLDataSource from true to false:

SQLDataSource<ClusterID> datasource = new SQLDataSource<>(
    query,
    new SQLDBConfig(connectionString, new HashMap<String, String>()),
    new ClusteringFactory(),
    rowProcessor,
    false
);
lazydog2 commented 2 months ago

I'm attempting to use HDBSCAN for unsupervised anomaly detection but I'm curious about the following statement from the Anomaly Detection tutorial, does it mean Tribuo's use of LibSVM or LibLinear could be used for unsupervised anomaly detection?

The LibSVM anomaly detection algorithm requires there are no anomalies in the training data, but this is not required in general for Tribuo's anomaly detection infrastructure.

Craigacp commented 2 months ago

Yes, those both work for unsupervised anomaly detection assuming by "unsupervised" you mean you only have a sample of non-anomalous data.

lazydog2 commented 2 months ago

The samples are unlabelled and will have non-anomalous data and potentially some anomalous data so I'm using the HDBSCAN noise/outliers to identify anomalous data but I am wondering if I can use Tribuo with LibSVM or LibLinear do achieve a similar result.

Craigacp commented 2 months ago

Unfortunately both LibSVM and LibLinear do anomaly detection by computing the probability of the incoming point being generated by their model of the training data, so if there are any anomalies in the training data then they make the incoming data more likely to be considered normal.

lazydog2 commented 2 months ago

With that in mind, are you able to provide further clarification on this statement from the Anomaly Detection tutorial as I'm interpreting it to mean that the absence of anomalies in the training data "is not required in general for Tribuo's anomaly detection infrastructure":

The LibSVM anomaly detection algorithm requires there are no anomalies in the training data, but this is not required in general for Tribuo's anomaly detection infrastructure.

Craigacp commented 2 months ago

It means that it's not required for the core infrastructure of the anomaly detection packages (e.g. all the stuff in tribuo-anomaly-core) but it is required in the two algorithms that we have implemented for anomaly detection (LibSVM and LibLinear). Tribuo's development is mostly driven by the use cases we've had, and for anomaly detection the requirement that the training data not be anomalous has been acceptable so we haven't had to implement other algorithms which can deal with the training data being a mixture of anomalous and normal data.

lazydog2 commented 2 months ago

Thanks for the clarification @Craigacp and for all your other responses as part of this issue.