Killed - Githubissues

thomasstats commented 4 years ago

Attempting to train magic point and super point with my own dataset. I have a very large dataset (1.2 million 1080p images for training).

I keep hitting a point pre-training where I run out of RAM and the process is Killed

2019-10-15 15:31:39.699165: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1 2019-10-15 15:31:39.699173: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10 2019-10-15 15:31:39.699180: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10 2019-10-15 15:31:39.699186: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10 2019-10-15 15:31:39.699193: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10 2019-10-15 15:31:39.699199: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10 2019-10-15 15:31:39.699206: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-10-15 15:31:39.699843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2019-10-15 15:31:39.699868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-10-15 15:31:39.699874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-10-15 15:31:39.699877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2019-10-15 15:31:39.700534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10037 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1) W1015 15:31:39.700742 140694165006144 deprecation_wrapper.py:119] From /home/statslinux/Workspace/SuperPoint/superpoint/models/base_model.py:280: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead. Killed

I have

cache_in_memory: false

I even tried disabling photometric augmentation, and currently set resize to 240,320 to help reduce RAM usage. I still encounter the problem. Is the program loading, preprocessing the images, and then not removing from memory?

thomasstats commented 4 years ago

Update:

To add to this, I've narrowed it down to this line in base_model.py, _build_graph(self)

self.dataset_handles[n] = self.sess.run(i.string_handle())

My memory usage (32 GB total) resting at 18% usage, once get_dataset is called it goes to 45%. Then at this line it shoots to 100% and kills the process.

Also to be noted, my batch size is set to 1.

rpautrat commented 4 years ago

Hi,

Can you share the file that you used to import your own dataset? I mean the equivalent of coco.py or patches_dataset.py, but for your own dataset.

thomasstats commented 4 years ago

def _init_dataset(self, **config):
        DATA_PATH = '/path/to/files' #it is on a separate drive so i've this hardcoded
        image_paths = []
        names = []
        for r,d,f in os.walk(DATA_PATH):
            for file in f:
                names.append(file.split('.')[0])
                image_paths.append(os.path.join(r,file))

        if config['truncate']:
            image_paths = image_paths[:config['truncate']]
        files = {'image_paths': image_paths, 'names': names}

        if config['labels']:
            label_paths = []
            for n in names:
                p = Path('/path/to/labels/', config['labels'], '{}.npz'.format(n))
                assert p.exists(), 'Image {} has no corresponding label {}'.format(n, p)
                label_paths.append(str(p))
            files['label_paths'] = label_paths
        tf.data.Dataset.map_parallel = lambda self, fn: self.map(fn, num_parallel_calls=config['num_parallel_calls'])
        return files

    def _get_data(self, files, split_name, **config):
        has_keypoints = 'label_paths' in files
        is_training = split_name == 'training'

        def _read_image(path):
            image = tf.read_file(path)
            image = tf.image.decode_jpeg(image, channels=3)
            return image

        def _preprocess(image):
            image = tf.image.rgb_to_grayscale(image)
            if config['preprocessing']['resize']:
                image = pipeline.ratio_preserving_resize(image, **config['preprocessing'])
            return image

        def _read_points(filename):
            return np.load(filename.decode('utf-8'))['points'].astype(np.float32)

        names = tf.data.Dataset.from_tensor_slices(files['names'])
        images = tf.data.Dataset.from_tensor_slices(files['image_paths'])
        images = images.map(_read_image)
        images = images.map(_preprocess)
        data = tf.data.Dataset.zip({'image': images, 'name': names})

        if has_keypoints:
            kp = tf.data.Dataset.from_tensor_slices(files['label_paths'])
            kp = kp.map(lambda path: tf.py_func(_read_points, [path], tf.float32))
            kp = kp.map(lambda points: tf.reshape(points, [-1,2]))
            data = tf.data.Dataset.zip((data,kp)).map(lambda d, k: {**d, 'keypoints': k})
            data = data.map(pipeline.add_dummy_valid_mask)

        if split_name == 'validation':
            data = data.take(config['validation_size'])

        if config['cache_in_memory']:
            tf.logging.info('Caching data, first access will take some time.')
            data = data.cache()

        if config['warped_pair']['enable']:
            assert has_keypoints
            warped = data.map_parallel(lambda d: pipeline.homographic_augmentation(d, add_homography=True, **config['warped_pair']))
            if is_training and config['augmentation']['photometric']['enable']:
                warped = warped.map_parallel(lambda d: pipeline.photometric_augmentation(d, **config['augmentation']['photometric']))
            warped = warped.map_parallel(pipeline.add_keypoint_map)
            data = tf.data.Dataset.zip((data, warped))
            data = data.map(lambda d,w: { **d, 'warped': w})

        if has_keypoints and is_training:
            if config['augmentation']['photometric']['enable']:
                data = data.map_parallel(lambda d: pipeline.photometric_augmentation(d, **config['augmentation']['photometric']))
            if config['augmentation']['homographic']['enable']:
                assert not config['warped_pair']['enable']
                data = data.map_parallel(lambda d: pipeline.homographic_augmentation(d, **config['augmentation']['homographic']))

        if has_keypoints:
            data = data.map_parallel(pipeline.add_keypoint_map)

        return data

Most of it is the same. The major difference is the retrieving of the filenames. You think this may be eating all the memory is storing it this way vs the generator? I wasn't entirely sure how to replicate the code you had for my situation so I wrote it this way. The hard drive that I store superpoint on is way too small for this, so I have the dataset stored on a separate drive. Whereas, when I initially was testing this with coco, coco easily fits on that hard drive. Maybe reset it to how it was and just modify the directory settings file as necessary?

rpautrat commented 4 years ago

This looks fine to me, I don't think that the change of path or the fact that the dataset is located on a separate hard drive is an issue.

You can still try to truncate the dataset to a small amount ('truncate' parameter) and run with the exact same configuration to see if the problem is really the size of the dataset.

But otherwise I don't understand why it would fill the memory entirely like this. This a tf Dataset, so it should be able to free the memory with the previous elements and retrieve only one batch at a time.

thomasstats commented 4 years ago

Yeah it doesn't appear to leave memory either. I moved some memory over to that machine and then decreased the size of the training to get it to a hair below 32GB. Once it hits that line that memory is used up and never freed. Which as you stated, is odd, because it seems like it should just iterate and then dump.

I'll look into truncate once my current experiment is done with the subset of training data.

rpautrat / SuperPoint

Killed #111