StopIteration Error when using workers = multiprocessing.cpu_cout()

RLstat commented 6 years ago

Hi, Waleed and other experts,

 I am using this implementation for the 2018 data science bowl on Kaggle, and most of time, I am running on colab, which has 2 cpu and 1 K80 gpu.

 However, I often encounter this StopIteration error shown in attached image. It didn't show up at the beginning of training process. As a matter of fact, the time it shows up is relatively random, but generally within the first 10 epochs. 

The only way I currently can fix it is by setting workers = 0, which disables the multiprocessing ability for data_generator. But it will make the program significantly slower than using multiprocessing in data_generator. 

Do you have any idea how to fix this and why this happens? Thanks!

jameschartouni commented 6 years ago

I think the multiprocessing=True has been causing people some unexpected issues. I just get a warning error. You can comment out that line of code in Model.py. I'm working on the same Kaggle competition and after updating to the latest branch, I am getting an error on the same generator_output = next(output_generator) line of code in training.py. Do you think it has anything to do with the data specific to the competition? I keep on tinkering with the config files to see if i'm doing something wrong.

~/anaconda3/lib/python3.6/site-packages/keras/engine/training.py` in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   2190                 batch_index = 0
   2191                 while steps_done < steps_per_epoch:
-> 2192                     generator_output = next(output_generator)
   2193 
   2194                     if not hasattr(generator_output, '__len__'):

~/anaconda3/lib/python3.6/site-packages/keras/utils/data_utils.py in get(self)
    791             success, value = self.queue.get()
    792             if not success:
--> 793                 six.reraise(value.__class__, value, value.__traceback__)

~/anaconda3/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

ValueError: Unable to create correctly shaped tuple from [(51, 51), (0, 0), (0, 0)]

waleedka commented 6 years ago

@RLstat My best guess is that the issue is a bug in your Dataset class. I'm basing that on the error message you sent, where it seems that it's raising the StopIteration error when the queue is empty. I could be wrong, but that's something to investigate. I have not seen this error before.

@jameschartouni Your error is something different. It's complaining about not being able to create correctly shaped tuple. Check your Dataset class as well, especially if you have code that applies padding to images. I recall seeing such error (in a different project) related to padding images.

Du-Shaojie commented 6 years ago

@RLstat I have the same errror with you, how do you solve it?

RLstat commented 6 years ago

@waleedka , Thank you for your suggestion. Here is my Dataset class attached. I am not sure it is the reason, since it only generates error when I use multiprocessing = True and workers >=1. It works fine when I use workers = 0 (which is much slower). Please let me know if you see potential problems here. I modified it from your train_shapes example.

class BowlDataset(utils.Dataset):
    """Generates the shapes synthetic dataset. The dataset consists of simple
    shapes (triangles, squares, circles) placed randomly on a blank surface.
    The images are generated on the fly. No file access required.
    """

    def load_bowls(self, datapath):
        # Add classes
        self.add_class("bowl", 1, "nuclei")
        dataids = next(os.walk(datapath))[1]

        # Add images
        for i in range(len(dataids)):
            img_path = datapath + dataids[i] + '/images/' + dataids[i] + '.png'
            mask_dir  = datapath + dataids[i] + '/masks/'
            image_shape = imread(img_path).shape
            height = image_shape[0]
            width  = image_shape[1]
            self.add_image("bowl", image_id=i, path=img_path, 
                           mask_dir = mask_dir, height = height,
                          width = width)

    def image_reference(self, image_id):
        info = self.image_info[image_id]
        if info["source"] == "bowl":
            return info["id"]
        else:
            super(self.__class__).image_reference(self, image_id)

    def load_image(self, image_id):
        info = self.image_info[image_id]
        img = imread(info["path"])[:,:,:3]

        return img.astype(np.uint8)

    def load_mask(self, image_id):
        info = self.image_info[image_id]
        mask_dir = info['mask_dir']
        mask_file_list = next(os.walk(mask_dir))[2]
        mask = np.zeros([info['height'], info['width'], len(mask_file_list)], 
                        dtype=np.uint8)
        for i, mask_file in enumerate(mask_file_list):
            mask_img = imread(mask_dir + mask_file)

            mask[:,:,i] = mask_img

        mask = np.where(mask > 128, 1, 0)

        return mask, np.ones(len(mask_file_list)).astype(np.int32)

waleedka commented 6 years ago

Dataset class looks good. I don't see anything obvious that would cause a problem. Could it be caused by some kind of limitation in the colab environment? Although, there are others using colab as well and so far I haven't seen reports on this issue.

While I can't help with that error, there are a few other unrelated notes that might be useful:

You don't need the load_image() function. The base Dataset class handles it well, including removing the alpha channel
load_mask() should return an array of type np.bool. Masks are generally boolean in this implementation

Another thing you can try is to cache the mask files in .npy files. It's 5 times faster than loading individual .png files, so it should help, especially if you end up setting workers=0. Here is my implementation. I'm planning to add it to this repo tonight or tomorrow.

    def load_mask(self, image_id):
        """Generate instance masks for an image.
       Returns:
        masks: A bool array of shape [height, width, instance count] with
            one mask per instance.
        class_ids: a 1D array of class IDs of the instance masks.
        """
        info = self.image_info[image_id]
        # Get mask directory from image path
        mask_dir = os.path.join(info["path"].split("/images/")[0], "masks")
        # Create a cache directory
        # Masks are in multiple png files, which is slow to load. So cache
        # them in a .npy file after the first load
        cache_dir = os.path.join(mask_dir, "../../cache")
        if not os.path.exists(cache_dir):
            os.makedirs(cache_dir)
        # Is there a cached .npy file?
        cache_path = os.path.join(cache_dir, "{}.npy".format(info["id"]))
        if os.path.exists(cache_path):
            mask = np.load(cache_path, allow_pickle=False)
        else:
            # Read mask files from .png image
            mask = []
            for f in next(os.walk(mask_dir))[2]:
                if f.endswith(".png"):
                    m = skimage.io.imread(os.path.join(mask_dir, f)).astype(np.bool)
                    mask.append(m)
            mask = np.stack(mask, axis=-1)
            # Cache the mask in a Numpy file
            np.save(cache_path, mask)
        # Return mask, and array of class IDs of each instance. Since we have
        # one class ID, we return an array of ones
        return mask, np.ones([mask.shape[-1]], dtype=np.int32)

waleedka commented 6 years ago

Correction, remove the allow_pickle=False in the code above.

Edit: Another fix: Push the cache directory one level up so it doesn't get treated like another image directory.

cache_dir = os.path.join(mask_dir, "../../../cache")

RLstat commented 6 years ago

@waleedka , Thank you for your suggestion! I am currently trying to run it in other cloud computing environment and check whether it is a colab specific issue or not. I will keep you posted.

By the way, have you considered using keras.util.Sequence instead of the generator? Since it is recommended for multiprocessing in keras' documentation?

I was trying to do that, but not very successful, especially when trying to achieve the functionality to skip images that fail to meet certain conditions (as in your generator, you skip images whose mask is empty). If you are trying to use keras.util.Sequence in future or have some insights, please let me know.

Thanks!!

waleedka commented 6 years ago

@RLstat Okay, ignore my mask caching idea. After testing it in a real situation, it turns out that .npy files are really large (easily exceed 150M for some masks) and that slows things down and can cause memory errors. The caching idea, in concept, is good, but implementing it with numpy.save() turned out to be a bad idea.

And, yes, I do want to change the code to use the Keras Sequence class. Just haven't had the chance yet. I would happily merge a PR if someone else manages to do it.

RLstat commented 6 years ago

@waleedka , I think the issue of StopIteration Error maybe related to the computing environment and resources. I tried it on google cloud, with 2CPU, 13G memory, the error still occurs, but almost always delayed to the stage when I am training "all" , where I see an increase in memory usage. It never happens in early stage training of "heads", as opposed to the case I've seen in colab.

I guess the multiprocessing in linux duplicates the generator and can cause large amount of resource consumption.... not sure what's the magic in keras.util.Sequence, but hope that will not consume as much as the current approach.

matterport / Mask_RCNN

StopIteration Error when using workers = multiprocessing.cpu_cout() #407