Closed mike0sv closed 2 years ago
As discussed in the sync with @mike0sv earlier today, we'll add batch-reading support for data formats that support chunksize
parameter when reading using Pandas API.
Unsupported data formats should also raise a clear MlemError
so that users will get informative error via the CLI. Numpy batch-reading support will not be included for this issue.
@mike0sv, I created a mock iris CSV dataset from my cloned local repository of example-mlem (no-dvc
branch) and attempted to run the following:
mlem init
python src/generate_data.py
python src/train.py data/train models/rf models/logreg
mlem link models/rf latest
// This step fails
mlem apply latest mock/data.csv -i --import-type 'pandas[csv]' -m predict_proba -o data/pred
Essentially, I'm trying to trigger this workflow where I want to import Dataset object on-the-fly to perform predictions - https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L83. However, I'm getting the following exception.
The printed object in the above screenshot was from a print statement which I added for debugging here - https://github.com/iterative/mlem/blob/main/mlem/api/commands.py#L376
print(ImportAnalyzer.types)
I noticed PandasImport
hook here (https://github.com/iterative/mlem/blob/main/mlem/contrib/pandas.py#L518) but it was somehow not retrieved for use, may I know if there's any documentation on this? Or config which I might be missing out?
Thank you!
Do you have pandas
installed it this env?
Yeah I do, pip freeze
yields the following:
can you run python -c "from mlem.ext import find_implementations; print(find_implementations())"
?
oh, you should probably also install numpy
also, created #200
can you run python -c "from mlem.ext import find_implementations; print(find_implementations())"?
oh, you should probably also install numpy
Ah, numpy
has already been installed!
Got it, fix is on the way. While PR waits for review, merge and new release, you can either add MLEM_ADDITIONAL_EXTENSIONS=mlem.contrib.pandas env or just work on the other branch in this condition :)
add MLEM_ADDITIONAL_EXTENSIONS=mlem.contrib.pandas env
Perfect, this resolves the error, thank you for the quick fix!
For big datasets and potentially for multi-file datasets (eg collection of images) we need an alternative way of reading data in batches. Potential use-case:
mlem apply model some-very-big.csv --batch 10000
DatasetReader
, alsoDataset
andDatasetMeta
could be affected. For some dataset types batching is impossible, so we should either ignore batch option and fallback to regular read, or raise an error (this should be configurable). Also, support for batch reading should be specified for each dataset type.mlem apply
cli and change underlying API call to support this feature