About csvLoader.loadDataSource

oracle / tribuo

Tribuo - A Java machine learning library

https://tribuo.org

Apache License 2.0

1.24k stars 172 forks source link

About csvLoader.loadDataSource #342

Open pablo3p opened 1 year ago

pablo3p commented 1 year ago

Hi there,

From this tutorial on regression: https://github.com/oracle/tribuo/blob/main/tutorials/regression-tribuo-v4.ipynb

var wineSource = csvLoader.loadDataSource(Paths.get("winequality-red.csv"),"quality");

This wineSource, is a data structure, but don't see enough documentation. I am assuming that wineSource here, is a tabular data structure, and hoping that it is similar to Python Pandas DataFrame.

If that is the case, is there a Print-Method, so one can print to the terminal to see the data.

There is not much out there on this.

Kind Regards,

Pablo

Craigacp commented 1 year ago

CSVLoader returns a CSVDataSource. The DataSource interface doesn't have much in the way of accessor methods, you should construct a MutableDataset from that data source which will populate the feature & output information objects that you can query. If you want to print out the examples you can iterate the data source and print each Example object.

Tribuo has a row-wise view of data, and doesn't provide a data frame style interface. If you want something more like a dataframe in Java then I think JTablesaw is supposed to be good for that, but I've not used it much.

pablo3p commented 1 year ago

Hi there, thanks for your quick reply. SO when passing in data, I want to make sure that it is proper, so it looks like there is no way to determine that once it is loaded and creates a CSVDataSource. I would prefer to load then the data from CSV into something like JTablesaw, and from JTablesaw pass that into a Tribuo DataSource. Wondering if this is possible? Hope you can let me know.

Craigacp commented 1 year ago

You can inspect the examples after they have been loaded to make sure the pipeline is valid. I recommend looking at CSVDataSource rather than using CSVLoader as it's more flexible. There's a columnar data tutorial which explains the mechanisms - https://tribuo.org/learn/4.3/tutorials/columnar-tribuo-v4.html.

We don't currently support loading from JTablesaw into Tribuo because we can't capture the necessary provenance & reproducibility information out of a tablesaw dataset. It would be pretty useful to have though, but due to the provenance issues we've not got around to it.

pablo3p commented 1 year ago

Hi, thanks again. The link you provided seems to have a lot of useful concepts etc.

Yes, to have something like JTablesaw, and have that first load the CSV and then pass it onto like the CSVDataSource, I think would be really good, because you can pass on the responsibility of the "integrity" of the data to the Data Science person, because they are the subject matter experts, and they should be able to look into the DataFrame(in this case JTablesaw) and then decide that the data is in proper shape to pass into the CSVDataSource data structure. Allowing for "Human Intervention" especially at the Data-source part of the Data Pipeline, is very valuable to allow the Data Science person more control in the Data Quality aspect of the Data Pipeline. This type or kind, should be an option and should be available in Tribuo. So just wanted to elaborate on my thinking on this. Thanks again for all your great help, really appreciate it. Best Regards,