Extend the CSVLoader class to read from different datasources/targets and different kinds of formats

neomatrix369 commented 3 years ago

Is your feature request related to a problem? Please describe. At the moment it appears the CSVLoader can only load .csv files from the disk or a file system. Which could be a limitation both from the functionality point of view and also provenance (metadata) recording point of view.

In the Provenance data, we see the path of the file given during the training process, this path could be invalid if the process was run in docker containers or another ephemeral means.

Other libraries (non-Java based) allow loading .tgz, .zip, etc formats and although this may just be a single step when trying to manage multiple datasets this can be a boon.

Describe the solution you'd like CSVLoader through sub-class implementations allow loading:

files downloaded from not just local file system but also via the web (Secure and public sources i.e. S3 bucket or github)
files stored in different formats i.e. .tgz, .zip (compressed formats mainly)
data stored in datastores / databases (via login/password or other connection strings)
additional metadata information about the dataset itself, i.e field definition and background of the dataset or links or resources to them

Additional context Maybe show these functionalities or other functionalities or features of CSVLoader via notebook tutorials.

This request is actually two folds:

file format
data source (or target) location

Once any or all of these are established, the provenance information can now have a bit more independent set of information on how to replicate the data loading process.

For e.g.


TrainTestSplitter(
    class-name = org.tribuo.evaluation.TrainTestSplitter
    source = CSVLoader(
            class-name = org.tribuo.data.csv.CSVLoader
            outputFactory = LabelFactory(
                    class-name = org.tribuo.classification.LabelFactory
                )
            response-name = species
            separator = ,
            quote = "
            path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data
            file-modified-time = 2020-07-06T10:52:01.938-04:00
            resource-hash = 36F668D1CBC29A8C2C1128C5D2F0D400FA04ED4DC62D12246F44CE9360360CC0
        )
    train-proportion = 0.7
    seed = 1
    size = 150
    is-train = true
)

From the above I could not recreate the model building process or just the data loading process easily because path = file:/Users/apocock/Development/Tribuo/tutorials/bezdekIris.data is local an individual computer system. While we could have paths like path = https://path/to/bezdekIris.data which would make the whole process a lot more independent. And also add value to the provenance metadata, as we would know the original source of the data.

Craigacp commented 3 years ago

The CSVLoader is designed to be a very simple and quick way of getting a numerical csv with a response column up off the disk and into Tribuo. The file format is moderately flexible in CSVLoader you can change the separator and quote characters, but I don't expect to expand CSVLoader beyond that. For anything more complex than that in terms of format, processing or other information you should use CSVDataSource, IDXDataSource, JsonDataSource, SQLDataSource, or TextDataSource. If the format isn't covered by one of those then we could look at adding a new data source, but it wouldn't be in CSVLoader. I agree it would be nice to have more provenance information in the data sources, and we could look at adding an instance field to the DataProvenance, but for the time being you can store extra information in the runProvenance argument to Trainer.train as that is stored in the resulting model.

With respect to transparently reading compressed formats, we could add that as an option to the previously mentioned data sources. For zip and tgz formats then we'd need to cope with potential directory structure and figure out which file to load from within the archive, which adds complexity.

It's intentional that the current loaders do not read from remote endpoints (apart from SQLDataSource which connects to an external database). We could relax this but it would have to be controlled by a flag, and it's a bit of an issue that a configuration file can make a web request to load something.

Craigacp commented 3 years ago

I'm finishing off a tutorial on RowProcessor which uses CSVDataSource and JsonDataSource to load more complex columnar data from csv and json files respectively.

neomatrix369 commented 3 years ago

@Craigacp I agree with the points above, maybe I wasn't clear enough. I meant inheriting from CSVDatasource and creating implementations that do that separately, like you mentioned CSVDataSource, IDXDataSource, JsonDataSource, SQLDataSource, or TextDataSource

The parent class of implementation is an implementation detail, it's just to provide more ways to load data and capture the source is what I was eluding to.

neomatrix369 commented 3 years ago

About compressed files, directory/folder support won't be necessary, just allowing it to detect that it's compressed csv/json file is more than enough. There may be use-cases for such usage as I have already seen that when data files get used, lots of different lightweight formats are sought after to solve read/write issues (storage and latency).

Craigacp commented 3 years ago

We already have something that can transparently figure out if it's a GZipped file elsewhere in OLCUT, which will return the appropriate input stream implementation. We could probably extend that to support zip, but I don't think we'd want to induce a dependency inside OLCUT to get bzip support.

Craigacp commented 3 years ago

So concretely there would be:

optional loading of gzip or zip compressed files through the data sources
loading files over the web (most libraries that do this provide a caching mechanism which would require designing, especially as the provenance hash currently reads the file a second time, it would be bad to download it twice)
mechanism for adding additional metadata to a datasource (e.g. additional provenance information on construction? or something else)
support for other data formats

For the last point I'm not clear what's required. Tribuo can already connect to things via JDBC, and read delimited and json format inputs. Are there other major formats we should support?

We use ColumnarDataSource as the base class for CSV, Json and SQL format data, so there could be other subclasses of that for other columnar inputs.

oracle / tribuo

Extend the CSVLoader class to read from different datasources/targets and different kinds of formats #70