Interface for batch loading for datasets

mike0sv commented 3 years ago

For big datasets and potentially for multi-file datasets (eg collection of images) we need an alternative way of reading data in batches. Potential use-case: mlem apply model some-very-big.csv --batch 10000

Design interface changes. Probably this should be a new method in DatasetReader, also Dataset and DatasetMeta could be affected. For some dataset types batching is impossible, so we should either ignore batch option and fallback to regular read, or raise an error (this should be configurable). Also, support for batch reading should be specified for each dataset type.
Add new option to mlem apply cli and change underlying API call to support this feature
Implement batch reading for pandas as POC, other dataset types may not support this

terryyylim commented 2 years ago

As discussed in the sync with @mike0sv earlier today, we'll add batch-reading support for data formats that support chunksize parameter when reading using Pandas API.

Unsupported data formats should also raise a clear MlemError so that users will get informative error via the CLI. Numpy batch-reading support will not be included for this issue.

terryyylim commented 2 years ago

@mike0sv, I created a mock iris CSV dataset from my cloned local repository of example-mlem (no-dvc branch) and attempted to run the following:

mlem init
python src/generate_data.py
python src/train.py data/train models/rf models/logreg
mlem link models/rf latest

// This step fails
mlem apply latest mock/data.csv -i --import-type 'pandas[csv]' -m predict_proba -o data/pred

Essentially, I'm trying to trigger this workflow where I want to import Dataset object on-the-fly to perform predictions - https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L83. However, I'm getting the following exception.

The printed object in the above screenshot was from a print statement which I added for debugging here - https://github.com/iterative/mlem/blob/main/mlem/api/commands.py#L376

print(ImportAnalyzer.types)

I noticed PandasImport hook here (https://github.com/iterative/mlem/blob/main/mlem/contrib/pandas.py#L518) but it was somehow not retrieved for use, may I know if there's any documentation on this? Or config which I might be missing out?

Thank you!

mike0sv commented 2 years ago

Do you have pandas installed it this env?

terryyylim commented 2 years ago

Yeah I do, pip freeze yields the following:

mike0sv commented 2 years ago

can you run python -c "from mlem.ext import find_implementations; print(find_implementations())"?

mike0sv commented 2 years ago

oh, you should probably also install numpy

mike0sv commented 2 years ago

also, created #200

terryyylim commented 2 years ago

can you run python -c "from mlem.ext import find_implementations; print(find_implementations())"?

oh, you should probably also install numpy

Ah, numpy has already been installed!

mike0sv commented 2 years ago

Got it, fix is on the way. While PR waits for review, merge and new release, you can either add MLEM_ADDITIONAL_EXTENSIONS=mlem.contrib.pandas env or just work on the other branch in this condition :)

terryyylim commented 2 years ago

add MLEM_ADDITIONAL_EXTENSIONS=mlem.contrib.pandas env

Perfect, this resolves the error, thank you for the quick fix!

iterative / mlem

Interface for batch loading for datasets #23