Open neomatrix369 opened 3 years ago
The CSVLoader
is designed to be a very simple and quick way of getting a numerical csv with a response column up off the disk and into Tribuo. The file format is moderately flexible in CSVLoader
you can change the separator and quote characters, but I don't expect to expand CSVLoader
beyond that. For anything more complex than that in terms of format, processing or other information you should use CSVDataSource
, IDXDataSource
, JsonDataSource
, SQLDataSource
, or TextDataSource
. If the format isn't covered by one of those then we could look at adding a new data source, but it wouldn't be in CSVLoader
. I agree it would be nice to have more provenance information in the data sources, and we could look at adding an instance field to the DataProvenance
, but for the time being you can store extra information in the runProvenance
argument to Trainer.train
as that is stored in the resulting model.
With respect to transparently reading compressed formats, we could add that as an option to the previously mentioned data sources. For zip
and tgz
formats then we'd need to cope with potential directory structure and figure out which file to load from within the archive, which adds complexity.
It's intentional that the current loaders do not read from remote endpoints (apart from SQLDataSource
which connects to an external database). We could relax this but it would have to be controlled by a flag, and it's a bit of an issue that a configuration file can make a web request to load something.
I'm finishing off a tutorial on RowProcessor
which uses CSVDataSource
and JsonDataSource
to load more complex columnar data from csv and json files respectively.
@Craigacp I agree with the points above, maybe I wasn't clear enough. I meant inheriting from CSVDatasource
and creating implementations that do that separately, like you mentioned CSVDataSource
, IDXDataSource
, JsonDataSource
, SQLDataSource
, or TextDataSource
The parent class of implementation is an implementation detail, it's just to provide more ways to load data and capture the source is what I was eluding to.
About compressed files, directory/folder support won't be necessary, just allowing it to detect that it's compressed csv/json file is more than enough. There may be use-cases for such usage as I have already seen that when data files get used, lots of different lightweight formats are sought after to solve read/write issues (storage and latency).
We already have something that can transparently figure out if it's a GZipped file elsewhere in OLCUT, which will return the appropriate input stream implementation. We could probably extend that to support zip, but I don't think we'd want to induce a dependency inside OLCUT to get bzip support.
So concretely there would be:
For the last point I'm not clear what's required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support?
We use ColumnarDataSource
as the base class for CSV, Json and SQL format data, so there could be other subclasses of that for other columnar inputs.
Is your feature request related to a problem? Please describe. At the moment it appears the
CSVLoader
can only load.csv
files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.
Other libraries (non-Java based) allow loading
.tgz
,.zip
, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.Describe the solution you'd like
CSVLoader
through sub-class implementations allow loading:.tgz
,.zip
(compressed formats mainly)Additional context Maybe show these functionalities or other functionalities or features of
CSVLoader
via notebook tutorials.This request is actually two folds:
Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.
For e.g.
From the above I could not recreate the model building process or just the data loading process easily because
path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
is local an individual computer system. While we could have paths likepath = https://path/to/bezdekIris.data
which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.