image size issue with inmemory = false

latha-ai commented 5 years ago

When input feature typ:image inmemory = false expected the images to be load just in time. Unexpectedly the images are saved to hdf5 file that will never end when large images in high number are to be trained? Is there a road map of features that are planned to be in immediate fixes or releases?

Thanks

w4nderlust commented 5 years ago

Sorry I have a hard time understanding the language used to describe the issue, can you try rephrasing it?

latha-ai commented 5 years ago

I am looking for image classification with images which can be more than 400x400

inputfeaure has parameter in_memory Documentation: in_memory (default: true): defines whether image dataset will reside in memory during the training process or will be dynamically fetched from disk (useful for large datasets). In the latter case a training batch of input images will be fetched from disk each training iteration.

ludwig/features/image_feature.py line 105 What I see is when in_memory=false the data is written to a hdf5 file on disk. -- that will take for ever even with resize and will have limitations based on input

Process of dynamically fetching is not clear. What is the recommended approach for training large images? data_train_csv is supposed to have for all iterations, is that correct ?

I am using experiment function in integration_tests

input_features_template = Template( "[{type: image, name: random_image, width: 299, in_memory: false," " height: 299, num_channels: 1, encoder: ${encoder}," " resnet_size: 8, destination_folder: ${folder}}]") experiment( model_definition, model_definition_file=None, data_csv=None, data_train_csv=None, data_validation_csv=None, data_test_csv=None, data_hdf5=None, data_train_hdf5=None, data_validation_hdf5=None, data_test_hdf5=None, train_set_metadata_json=None, experiment_name='experiment', model_name='run', model_load_path=None, model_resume_path=None, skip_save_progress_weights=False, skip_save_processed_input=False, skip_save_unprocessed_output=False, output_directory='results', gpus=None, gpu_fraction=1.0, use_horovod=False, random_seed=default_random_seed, debug=False, **kwargs

w4nderlust commented 5 years ago

Ludwig puts the output of the preprocessing in the hdf5 file. You can skip saving it, there is a specific parameter for that --skip_savw_processed_input, but then you can't be able to read from disk because the in_memory parameter assumes there is a hdf5 file to read from. Preprocessing images is an expensive task and doing it one time before training is a common practice, otherwise you would have to do it every single time every batch every epoch and will make your training extremely slow. Finally, depending on the task, 400x400 images may be really big and not that useful, consider using the resize functionality. Finally, at the moment the process of preprocessing images and texts (the two most expensive parts of preprocessing at the moment) is performed by a single thread. We have on our roadmap to add a parallel implementation that will make creating the hdf5 file much faster. But again consider that if you don't pay that cost upfront you pay it during training, and you pay it many more times.

@ydudin3 can you investigate the feasibility of having an in_memory=False that works directly from the csv without an intermediate hdf5? I consider this a feature request.

ludwig-ai / ludwig

image size issue with inmemory = false #161