Closed lazydog2 closed 2 months ago
Aside from the SQL connection information SQLDataSource
behaves exactly like CSVDataSource
in terms of how Tribuo processes the data, which you can see in the columnar data tutorial. Is there something about how the SQL connection works that we should document better? I think internally we may only have used it against Oracle DBs, but it should talk to anything that works via JDBC.
Thanks @Craigacp, I've been working my way through the documentation and figuring out how to construct the SQLDataSource. The SQLDataSource constructor includes an OutputFactory argument that the CSVDataSource doesn't but I believe this could be a ClusteringFactory when used to perform clustering using HDBSCAN but a SQLDataSource example, whether it be for HDBSCAN or another technique, would help determine the correct usage of SQLDataSource for someone who is unfamiliar with Tribuo.
Yes, you should use a ClusteringFactory
for HDBSCAN. The SQLDataSource
should probably be refactored to pull its output factory from the RowProcessor
, which is how the CSVDataSource
works. We must have missed that in the refactor which unified the columnar processing infrastructure.
If you have a ground truth clustering that you want to load in for later comparison then you'll need to write a ResponseProcessor
which creates the appropriate ClusterID
object from your source. Otherwise use EmptyResponseProcessor
and pass it a ClusteringFactory
.
Thanks @Craigacp. Is the RowProcessor tutorial you mention in response to leccelecce's comment below, the same columnar data tutorial you mentioned in your comment above? https://github.com/oracle/tribuo/issues/50#issuecomment-709612052
While not related to this SQLDataSource Example issue, I am also looking for an example of using a non-double Feature value e.g. a BigInteger.
Yes, that's the one.
All Tribuo feature values are doubles, you can use RowProcessor to transform other things into features but the values must be doubles as that's what all the computational backends support.
Hi @Craigacp, I'm having trouble with either the SQLDataSource or a custom Processor. When I run the code, it produces the following output:
INFO: Iterated over 1000000 rows
[WARNING]
java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
at org.tribuo.math.neighbour.bruteforce.NeighboursBruteForce.<init> (NeighboursBruteForce.java:48)
at org.tribuo.math.neighbour.bruteforce.NeighboursBruteForceFactory.createNeighboursQuery (NeighboursBruteForceFactory.java:99)
at org.tribuo.math.neighbour.bruteforce.NeighboursBruteForceFactory.createNeighboursQuery (NeighboursBruteForceFactory.java:38)
at org.tribuo.clustering.hdbscan.HdbscanTrainer.calculateCoreDistances (HdbscanTrainer.java:335)
at org.tribuo.clustering.hdbscan.HdbscanTrainer.train (HdbscanTrainer.java:265)
at org.tribuo.clustering.hdbscan.HdbscanTrainer.train (HdbscanTrainer.java:302)
I've put a print statement in the process method of my CustomFieldProcessor and it doesn't seem to ever be called. Can you see anything in the code below that is missing or hasn't been done properly?
Map<String, FieldProcessor> fieldProcessors = new HashMap<>();
fieldProcessors.put("feature1", new CustomFieldProcessor("feature1"));
fieldProcessors.put("feature2", new CustomFieldProcessor("feature2"));
fieldProcessors.put("feature3", new CustomFieldProcessor("feature3"));
RowProcessor<ClusterID> rowProcessor = new RowProcessor<>(new EmptyResponseProcessor<>(new ClusteringFactory()), fieldProcessors);
SQLDataSource<ClusterID> datasource = new SQLDataSource<>(
query,
new SQLDBConfig(connectionString, new HashMap<String, String>()),
new ClusteringFactory(),
rowProcessor,
true
);
Dataset<ClusterID> dataset = new MutableDataset<>(datasource);
HdbscanTrainer trainer = new HdbscanTrainer(5);
HdbscanModel model = trainer.train(dataset);
I'm not sure exactly what I changed, it would have only been something minor, but it is now working.
I think the fix was changing the outputRequired argument of the SQLDataSource from true to false:
SQLDataSource<ClusterID> datasource = new SQLDataSource<>(
query,
new SQLDBConfig(connectionString, new HashMap<String, String>()),
new ClusteringFactory(),
rowProcessor,
false
);
I'm attempting to use HDBSCAN for unsupervised anomaly detection but I'm curious about the following statement from the Anomaly Detection tutorial, does it mean Tribuo's use of LibSVM or LibLinear could be used for unsupervised anomaly detection?
The LibSVM anomaly detection algorithm requires there are no anomalies in the training data, but this is not required in general for Tribuo's anomaly detection infrastructure.
Yes, those both work for unsupervised anomaly detection assuming by "unsupervised" you mean you only have a sample of non-anomalous data.
The samples are unlabelled and will have non-anomalous data and potentially some anomalous data so I'm using the HDBSCAN noise/outliers to identify anomalous data but I am wondering if I can use Tribuo with LibSVM or LibLinear do achieve a similar result.
Unfortunately both LibSVM and LibLinear do anomaly detection by computing the probability of the incoming point being generated by their model of the training data, so if there are any anomalies in the training data then they make the incoming data more likely to be considered normal.
With that in mind, are you able to provide further clarification on this statement from the Anomaly Detection tutorial as I'm interpreting it to mean that the absence of anomalies in the training data "is not required in general for Tribuo's anomaly detection infrastructure":
The LibSVM anomaly detection algorithm requires there are no anomalies in the training data, but this is not required in general for Tribuo's anomaly detection infrastructure.
It means that it's not required for the core infrastructure of the anomaly detection packages (e.g. all the stuff in tribuo-anomaly-core
) but it is required in the two algorithms that we have implemented for anomaly detection (LibSVM and LibLinear). Tribuo's development is mostly driven by the use cases we've had, and for anomaly detection the requirement that the training data not be anomalous has been acceptable so we haven't had to implement other algorithms which can deal with the training data being a mixture of anomalous and normal data.
Thanks for the clarification @Craigacp and for all your other responses as part of this issue.
Ask the question I'm trying to use an SQLDataSource for HDBSCAN clustering and would appreciate an example of using SQLDataSource as there doesn't seem to be one in the documentation.
Is your question about a specific ML algorithm or approach? HDBSCAN
Is your question about a specific Tribuo class? SQLDataSource
System details
Additional context N/A