Open RLstat opened 6 years ago
I think the multiprocessing=True has been causing people some unexpected issues. I just get a warning error. You can comment out that line of code in Model.py. I'm working on the same Kaggle competition and after updating to the latest branch, I am getting an error on the same generator_output = next(output_generator) line of code in training.py. Do you think it has anything to do with the data specific to the competition? I keep on tinkering with the config files to see if i'm doing something wrong.
~/anaconda3/lib/python3.6/site-packages/keras/engine/training.py` in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
2190 batch_index = 0
2191 while steps_done < steps_per_epoch:
-> 2192 generator_output = next(output_generator)
2193
2194 if not hasattr(generator_output, '__len__'):
~/anaconda3/lib/python3.6/site-packages/keras/utils/data_utils.py in get(self)
791 success, value = self.queue.get()
792 if not success:
--> 793 six.reraise(value.__class__, value, value.__traceback__)
~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
ValueError: Unable to create correctly shaped tuple from [(51, 51), (0, 0), (0, 0)]
@RLstat My best guess is that the issue is a bug in your Dataset class. I'm basing that on the error message you sent, where it seems that it's raising the StopIteration
error when the queue is empty. I could be wrong, but that's something to investigate. I have not seen this error before.
@jameschartouni Your error is something different. It's complaining about not being able to create correctly shaped tuple
. Check your Dataset class as well, especially if you have code that applies padding to images. I recall seeing such error (in a different project) related to padding images.
@RLstat I have the same errror with you, how do you solve it?
@waleedka , Thank you for your suggestion. Here is my Dataset class attached. I am not sure it is the reason, since it only generates error when I use multiprocessing = True and workers >=1. It works fine when I use workers = 0 (which is much slower). Please let me know if you see potential problems here. I modified it from your train_shapes example.
class BowlDataset(utils.Dataset):
"""Generates the shapes synthetic dataset. The dataset consists of simple
shapes (triangles, squares, circles) placed randomly on a blank surface.
The images are generated on the fly. No file access required.
"""
def load_bowls(self, datapath):
# Add classes
self.add_class("bowl", 1, "nuclei")
dataids = next(os.walk(datapath))[1]
# Add images
for i in range(len(dataids)):
img_path = datapath + dataids[i] + '/images/' + dataids[i] + '.png'
mask_dir = datapath + dataids[i] + '/masks/'
image_shape = imread(img_path).shape
height = image_shape[0]
width = image_shape[1]
self.add_image("bowl", image_id=i, path=img_path,
mask_dir = mask_dir, height = height,
width = width)
def image_reference(self, image_id):
info = self.image_info[image_id]
if info["source"] == "bowl":
return info["id"]
else:
super(self.__class__).image_reference(self, image_id)
def load_image(self, image_id):
info = self.image_info[image_id]
img = imread(info["path"])[:,:,:3]
return img.astype(np.uint8)
def load_mask(self, image_id):
info = self.image_info[image_id]
mask_dir = info['mask_dir']
mask_file_list = next(os.walk(mask_dir))[2]
mask = np.zeros([info['height'], info['width'], len(mask_file_list)],
dtype=np.uint8)
for i, mask_file in enumerate(mask_file_list):
mask_img = imread(mask_dir + mask_file)
mask[:,:,i] = mask_img
mask = np.where(mask > 128, 1, 0)
return mask, np.ones(len(mask_file_list)).astype(np.int32)
Dataset class looks good. I don't see anything obvious that would cause a problem. Could it be caused by some kind of limitation in the colab environment? Although, there are others using colab as well and so far I haven't seen reports on this issue.
While I can't help with that error, there are a few other unrelated notes that might be useful:
Another thing you can try is to cache the mask files in .npy files. It's 5 times faster than loading individual .png files, so it should help, especially if you end up setting workers=0. Here is my implementation. I'm planning to add it to this repo tonight or tomorrow.
def load_mask(self, image_id):
"""Generate instance masks for an image.
Returns:
masks: A bool array of shape [height, width, instance count] with
one mask per instance.
class_ids: a 1D array of class IDs of the instance masks.
"""
info = self.image_info[image_id]
# Get mask directory from image path
mask_dir = os.path.join(info["path"].split("/images/")[0], "masks")
# Create a cache directory
# Masks are in multiple png files, which is slow to load. So cache
# them in a .npy file after the first load
cache_dir = os.path.join(mask_dir, "../../cache")
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
# Is there a cached .npy file?
cache_path = os.path.join(cache_dir, "{}.npy".format(info["id"]))
if os.path.exists(cache_path):
mask = np.load(cache_path, allow_pickle=False)
else:
# Read mask files from .png image
mask = []
for f in next(os.walk(mask_dir))[2]:
if f.endswith(".png"):
m = skimage.io.imread(os.path.join(mask_dir, f)).astype(np.bool)
mask.append(m)
mask = np.stack(mask, axis=-1)
# Cache the mask in a Numpy file
np.save(cache_path, mask)
# Return mask, and array of class IDs of each instance. Since we have
# one class ID, we return an array of ones
return mask, np.ones([mask.shape[-1]], dtype=np.int32)
Correction, remove the allow_pickle=False
in the code above.
Edit: Another fix: Push the cache directory one level up so it doesn't get treated like another image directory.
cache_dir = os.path.join(mask_dir, "../../../cache")
@waleedka , Thank you for your suggestion! I am currently trying to run it in other cloud computing environment and check whether it is a colab specific issue or not. I will keep you posted.
By the way, have you considered using keras.util.Sequence instead of the generator? Since it is recommended for multiprocessing in keras' documentation?
I was trying to do that, but not very successful, especially when trying to achieve the functionality to skip images that fail to meet certain conditions (as in your generator, you skip images whose mask is empty). If you are trying to use keras.util.Sequence in future or have some insights, please let me know.
Thanks!!
@RLstat Okay, ignore my mask caching idea. After testing it in a real situation, it turns out that .npy files are really large (easily exceed 150M for some masks) and that slows things down and can cause memory errors. The caching idea, in concept, is good, but implementing it with numpy.save()
turned out to be a bad idea.
And, yes, I do want to change the code to use the Keras Sequence class. Just haven't had the chance yet. I would happily merge a PR if someone else manages to do it.
@waleedka , I think the issue of StopIteration Error maybe related to the computing environment and resources. I tried it on google cloud, with 2CPU, 13G memory, the error still occurs, but almost always delayed to the stage when I am training "all" , where I see an increase in memory usage. It never happens in early stage training of "heads", as opposed to the case I've seen in colab.
I guess the multiprocessing in linux duplicates the generator and can cause large amount of resource consumption.... not sure what's the magic in keras.util.Sequence, but hope that will not consume as much as the current approach.
Hi, Waleed and other experts,