redhat12345 commented 6 years ago

How can we use ur code in other RGB dataset? Suppose the structure of dataset is like that. it contains some sub-folder. Each sub-folder represents one class.

Class A: 0001.jpg 1 0002.jpg 1 Class B: 0001.jpg 2 0002.jpg 2

thibo73800 commented 6 years ago

Hi!

You have to change the pipeline of the data_handler.py script: https://github.com/thibo73800/capsnet-traffic-sign-classifier/blob/master/data_handler.py

mrinal18 commented 6 years ago

Hey, can you tell me again how to use another data as @redhat12345 asked? i didn't get how to change the pipeline. it will be a great help. thanx

thibo73800 commented 6 years ago

You have a method called get_data which first set the path to each dataset (Train, Test, Validation).

` TRAIN_FILE = "train.p" VALID_FILE = "valid.p" TEST_FILE = "test.p"

training_file = os.path.join(folder, TRAIN_FILE)
validation_file= os.path.join(folder, VALID_FILE)
testing_file =  os.path.join(folder, TEST_FILE)

` ".p" is an extension for pickle files. Thus, once the file is open with pickle, I can use the variable as a python dictionnary.

` with open(training_file, mode='rb') as f: train = pickle.load(f) with open(validation_file, mode='rb') as f: valid = pickle.load(f) with open(testing_file, mode='rb') as f: test = pickle.load(f)

# Retrive all data
X_train, y_train = train['features'], train['labels']
X_valid, y_valid = valid['features'], valid['labels']
X_test, y_test = test['features'], test['labels']

` train['features'] stored a list of all features, train['labels'] stored the list of all labels. Same thing for the validation and the test.

To summarize, all you need to do is create in this method your own code to open your dataset and return the following variables: X_train, y_train, X_valid, y_valid, X_test, y_test.

I Hope I success to clarify.

mrinal18 commented 6 years ago

Hey, thank you for such a fast reply. i get what's happening in the code. Although when i am trying to integrate my dataset with this, it's not happening. Since i am also new to this i kindly ask for your help in how to integrate it with a dataset which labeled folder and images within it. just like @redhat12345

Thank you

mrinal18 commented 6 years ago

!/usr/bin/python3

-- coding: utf-8 --

`from future import absolute_import from future import division from future import print_function import os import pickle

import argparse from datetime import datetime import hashlib import os.path import random import re import struct import sys import tarfile ' import numpy as np from six.moves import urllib import tensorflow as tf

from tensorflow.python.framework import graph_util from tensorflow.python.framework import tensor_shape from tensorflow.python.platform import gfile from tensorflow.python.util import compat

FLAGS = None BOTTLENECK_TENSOR_NAME = 'pool_3/_reshape:0' BOTTLENECK_TENSOR_SIZE = 2048 MODEL_INPUT_WIDTH = 299 MODEL_INPUT_HEIGHT = 299 MODEL_INPUT_DEPTH = 3 JPEG_DATA_TENSOR_NAME = 'DecodeJpeg/contents:0' RESIZED_INPUT_TENSOR_NAME = 'ResizeBilinear:0' MAX_NUM_IMAGES_PER_CLASS = 2 ** 27 - 1 # ~134M

TRAIN_FILE = "train.p" VALID_FILE = "valid.p" TEST_FILE = "test.p"

def get_data(image_dir): """Builds a list of training images from the file system.

Analyzes the sub folders in the image directory, splits them into stable training, testing, and validation sets, and returns a data structure describing the lists of images for each label and their paths.

Args: image_dir: String path to a folder containing subfolders of images. testing_percentage: Integer percentage of the images to reserve for tests. validation_percentage: Integer percentage of images reserved for validation.

Returns: A dictionary containing an entry for each label subfolder, with images split into training, testing, and validation sets within each label. """ if not gfile.Exists(image_dir): print("Image directory '" + image_dir + "' not found.") return None result = {} sub_dirs = [x[0] for x in gfile.Walk(image_dir)]

The root directory comes first, so skip it.

is_root_dir = True for sub_dir in sub_dirs: if is_root_dir: is_root_dir = False continue extensions = ['jpg', 'jpeg', 'JPG', 'JPEG'] file_list = [] dir_name = os.path.basename(sub_dir) if dir_name == image_dir: continue print("Looking for images in '" + dir_name + "'") for extension in extensions: file_glob = os.path.join(image_dir, dir_name, '*.' + extension) file_list.extend(gfile.Glob(file_glob)) if not file_list: print('No files found') continue if len(file_list) < 20: print('WARNING: Folder has less than 20 images, which may cause issues.') elif len(file_list) > MAX_NUM_IMAGES_PER_CLASS: print('WARNING: Folder {} has more than {} images. Some images will ' 'never be selected.'.format(dir_name, MAX_NUM_IMAGES_PER_CLASS)) label_name = re.sub(r'[^a-z0-9]+', ' ', dir_name.lower()) training_images = [] testing_images = [] validation_images = [] for file_name in file_list: base_name = os.path.basename(file_name)

We want to ignore anything after 'nohash' in the file name when

  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < 100:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + 100):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)
result[label_name] = {
    'dir': dir_name,
    'training': training_images,
    'testing': testing_images,
    'validation': validation_images,
}

# Retrive all datas
X_train, y_train = training_images['features'], training_images['labels']
X_valid, y_valid = validation_images['features'], validation_images['labels']
X_test, y_test = testing_images['features'], testing_images['labels']

return X_train, y_train, X_valid, y_valid, X_test, y_test

mrinal18 commented 6 years ago

This is the code i wrote for above case, but it is not working.

Ahmedest61 commented 6 years ago

@thibo73800 How did you make the pickle files for training and testing using the images? Can you share the code for that?

Ahmedest61 commented 6 years ago

What do you exactly mean by features? You preprocessed the datasets by running some third party features extractor on them? Can you please share what exactly these list "features" represent?

thibo73800 commented 6 years ago

@Ahmedest61 features are simply the pixels of the images, there is no processing on this part. You can directly plot (matplotlib) one image using the features field.

The pickle files are not mine, this is an online public dataset. The steps to download the dataset are described in the README.

Ahmedest61 commented 6 years ago

Thanks for the reply. I actually wanted to test my own dataset which are in the form of jpg files so that's the reason why I wanted to know how you encode the images into the pickle file form. So what I understand is; "features" can be considered as np arrays of the images in python terms?

thibo73800 commented 6 years ago

Yes exactly.

You even don't have to use pickle. Just open all your images, add the pixels into a numpy array, the label of each image into another. Then, split it in train/test/validation.

Finally, you should come up with six variables: X_train, y_train, X_valid, y_vali, X_test, y_test.

Ahmedest61 commented 6 years ago

@thibo73800 many thanks. In train.py, you extract the features and labels of test.p file but have not used it? In the training pipeline, the "while" loop will always keep running? Can you please share the logic behind it as after 1000 iterations, it saves a new model and starting training and validation again? Do I have to forcefully stop it to test the saved model? I'm actually new to tensor flow so I would really appreciate if you can share what actually going on inside. :-)

thibo73800 commented 6 years ago

Yes, I do not use the test part of the dataset. The logic behind it is to train the model on the training set while tweaking the hyperparameters to maximize the accuracy on both the validation and training set. So I do not take into account the testing set to change my hyperparameters. I test on the testing set into an other file. test.py.

I use an infinite while loop to do a manual "early stopping". Indeed, I watch the progression of the training on tensorboard to get an overview of the evolution. However, I save the model in the while loop only if a progression is done on the validation set

 if best_validation_loss is None or loss < best_validation_loss:
    best_validation_loss = loss
    model.save()

harryeee commented 6 years ago

@Ahmedest61 Hi，Have you matched the dataset for this interface?

thibo73800 commented 6 years ago

@Ahmedest61 Can you create a new issue for your question and give further information about your questions?

Thanks :)

Ahmedest61 commented 6 years ago

@harryeee Nopes, I have not reached to testing yet. I'm still playing with the training. @thibo73800 sure. Can you please tell me one more thing if I would like to change the dimensions of the images from 32x32 to 128x128 then inside your model, will _build_input() is the only function in which I need to change the dimensions or there will be other functions too?

thibo73800 commented 6 years ago

@Ahmedest61 Yes _build_input. If you want to include reconstruction loss, you also have to change the size of the reconstruction. I think there is nothing else.

thibo73800 / capsnet-traffic-sign-classifier

Other dataset #6

!/usr/bin/python3

-- coding: utf-8 --

The root directory comes first, so skip it.

We want to ignore anything after 'nohash' in the file name when