waldo-seg / waldo

image-segmentation and text-localization
Apache License 2.0
13 stars 13 forks source link

memory issue when saving processed data #41

Closed YiwenShaoStephen closed 6 years ago

YiwenShaoStephen commented 6 years ago

@danpovey just reminded me that there is a memory issue we haven't handled yet. That is for now, we are saving processed dataset to one file on disk (e.g. train.pth.tar) and before that we need to load them in memory (e.g.. train = process_data()..., torch.save(train)). For dsb2018 dataset, it won't be a issue since the dataset is quite small. But for madcat or other big datasets that we are going to try in the future, it will be a issue. For now, when @aarora8 tries to save the total processed dataset (42k images), there comes a memory error. How are we going to solve it? And it would be great if someone can offer a help here.

danpovey commented 6 years ago

Maybe you can look online for example scripts where people use PyTorch on large datasets like ImageNet, how they solve it there. Maybe there's a standard method.

It looks to me like the training scripts are just loading the dicts using torch.load (i.e. loading them all into memory). That likely wouldn't scale to madcat either.

It could be that that whole method of using .pth.tar files to store dicts is not going to be scalable and we should do something different.

On Tue, May 22, 2018 at 4:41 PM, Yiwen Shao notifications@github.com wrote:

@danpovey https://github.com/danpovey just reminded me that there is a memory issue we haven't handled yet. That is for now, we are saving processed dataset to one file on disk (e.g. train.pth.tar) and before that we need to load them in memory (e.g.. train = process_data()..., torch.save(train)). For dsb2018 dataset, it won't be a issue since the dataset is quite small. But for madcat or other big datasets that we are going to try in the future, it will be a issue. For now, when @aarora8 https://github.com/aarora8 tries to save the total processed dataset (42k images), there comes a memory error. How are we going to solve it? And it would be great if someone can offer a help here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldo-seg/waldo/issues/41, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu2_M6d1MXFkeZcSKnad5MvW4dzqBks5t1HfegaJpZM4UJWxh .

YiwenShaoStephen commented 6 years ago

OK, I will check it.

YiwenShaoStephen commented 6 years ago

I checked how people save and load large datasets like ImageNet and COCO. For ImageNet, since it's only for image recognition task and doesn't need any preprocessing, they just load from single image file (.png/jpg). For COCO, whose 'annotation' is complicated like 'captions' or 'mask', the annotations are in text format (e.g. JSON), like what we have for madcat dataset. The annotations are in a single file and well indexed. Once they load the data (e.g. foo.png), they will load image input from raw image, and at the same time, process annotations['foo.png'] to get "mask". The reason why I initially write a preprocess function is that I want to save the time of getting masks. Because for dsb2018 dataset, the masks are of image format thus reading it from disk is time-consuming. For example, if there are 100 nucleis in foo.png, there will be 100 images as the masks and using Image.open() to open them is time-consuming. However, what we have for madcat or other standard big datasets are the similar thing to the "annotations" in COCO (@aarora8 correct me if it's wrong). So my proposal is: 1) saving the annotations of all data to a single file in text format (not necessarily text format), and in most cases such file has already existed. 2) when we load the data, we generate combined_image on the fly. Not sure how long it will take to generate a combined_image from a raw image and text format mask. If it's fast, then this is the most standard way to deal with image segmentation task.

danpovey commented 6 years ago

OK, so they basically just read from files. I'd rather make use of all of our work that we did in getting the annotations into a common format. That will keep the training code standard and efficient. (And it will keep the files compact, since we already invested time in making the saved image and mask files 8-bit in the normal case).

How about we do it as follows: Each file will have an 'id' like an utterance-id... it might be the original filename; normally would be the same as the key in your dict, assuming it doesn't contain characters that can't be put in files. Instead of using pickle tools to write, we write in a common format where the destination is a directory, not a file. And it would write to the file 'index.txt' in that directory a list of the ids (like utterance-ids) which would be the base-name of .img and .mask files (both numpy arrays). We can write our own shared function to write that, and maybe our own data loader class that reads from that format. So it's kind of a standardized directory format with its own loaders and savers (where of course the savers and loaders would do it incrementally, not all in one batch).

Dan

On Tue, May 22, 2018 at 8:28 PM, Yiwen Shao notifications@github.com wrote:

I checked how people save and load large datasets like ImageNet and COCO. For ImageNet, since it's only for image recognition task and doesn't need any preprocessing, they just load from single image file (.png/jpg). For COCO, whose 'annotation' is complicated like 'captions' or 'mask', the annotations are in text format (e.g. JSON), like what we have for madcat dataset. The annotations are in a single file and well indexed. Once they load the data (e.g. foo.png), they will load image input from raw image, and at the same time, process annotations['foo.png'] to get "mask". The reason why I initially write a preprocess function is that I want to save the time of getting masks. Because for dsb2018 dataset, the masks are of image format thus reading it from disk is time-consuming. For example, if there are 100 nucleis in foo.png, there will be 100 images as the masks and using Image.open() to open them is time-consuming. However, what we have for madcat or other standard big datasets are the similar thing to the "annotations" in COCO (@aarora8 https://github.com/aarora8 correct me if it's wrong). So my proposal is:

  1. saving the annotations of all data to a single file in text format (not necessarily text format), and in most cases such file has already existed.
  2. when we load the data, we generate combined_image on the fly. Not sure how long it will take to generate a combined_image from a raw image and text format mask. If it's fast, then this is the most standard way to deal with image segmentation task.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldo-seg/waldo/issues/41#issuecomment-391182767, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu0CwZdvxoqUhkATPRpuiILP8czGFks5t1K03gaJpZM4UJWxh .

YiwenShaoStephen commented 6 years ago

This method is also applicable. Now @aarora8 is estimating how long will it take to get a combined image (the one we use for training) from raw image and annotation without any saving or loading functions. If it's fast, I think we don't need to write additional data to the disk because it will be the bottleneck for running time. According to @aarora8 , writing these files to disks for madcat (42k images) will take a super long time. But if generating masks are slow, then I think your proposal will be the most idea one since we can do it in parallel and save the time in loading data.

BTW, regarding loading the data, after trying more workers (i.e. number of threads), the time spent on loading data is negligible in training one iteration (minibatch).

YiwenShaoStephen commented 6 years ago

In short, previously what we did is: text-format annotation --> image_with_mask --> save image_with_mask --> load image_with_mask ---> convert it to combined_image I'm proposing: text-format annotation --> combined_image The bold one is the time we spent in training.

danpovey commented 6 years ago

Don't forget that part of the data preparation stage will be subsampling the input images, in many cases. That would mean that by preparing the images and masks in advance, we would save a large amount of disk I/O and also time. In addition, the training code would be more standardized. (and there would be less changes versus the code we have now, which means less work before we can have it working).

Guys, let's not discuss this any more. Please solve this by writing a standard saving/loading functionality similar to what I said above.

aarora8 commented 6 years ago

I tried processing data (around total 1400 train/test/valid page images), without saving it to the disk. It is taking around 13 mins (.55 sec per page image). But it is for an image whose size is reduced from 6100 X 5500 pixels to 512 X 512 pixels. I can do one more experiment with processing and saving

danpovey commented 6 years ago

You can put it in scripts/waldo/data_io.py and have a DataSaver class whose constructor will take a directory name and which will have functions

write_image(self, name, image_with_mask)

and

write_index(self) [to be called after doing 'write_image' for all images; the class should check you do it in the right order and don't forget to call write_index before destroying the object.]

You can also put a standard Dataset class there, e.g. WaldoDataset, that can read these objects from the directory in standard format. OK to import torch for now to inherit from their Dataset class.

YiwenShaoStephen commented 6 years ago

@danpovey Oh yes, downsampling is crucial here. Then preprocess is necessary.

YiwenShaoStephen commented 6 years ago

@danpovey Just to make sure, by "not using pickle", can we use the io methods provided by PIL or scipy to write .mask and .img?

danpovey commented 6 years ago

I was thinking numpy might have its own mechanism? Actually it's OK to use pickle, I just mean, do it to individual files, not to one big dict.

On Tue, May 22, 2018 at 9:37 PM, Yiwen Shao notifications@github.com wrote:

@danpovey https://github.com/danpovey Just to make sure, by "not using pickle", can we use the io methods provided by PIL or scipy to write .mask and .img?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldo-seg/waldo/issues/41#issuecomment-391192783, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu09Cg6fUPcnug3kr_Yh7H9ZNpJMAks5t1L1kgaJpZM4UJWxh .

YiwenShaoStephen commented 6 years ago

Get it! I will try to use numpy.save. It seems faster.

danpovey commented 6 years ago

BTW, I think it's a nice idea to give the dataset loader an option whereby if you initialize with 'cache = True' it will cache things in memory.

On Tue, May 22, 2018 at 9:43 PM, Yiwen Shao notifications@github.com wrote:

Get it! I will try to use numpy.save. It seems faster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/waldo-seg/waldo/issues/41#issuecomment-391193600, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5uvZlf5GVZHr9ZD5-yLXYDJnG91ks5t1L6zgaJpZM4UJWxh .