Closed manuelschmidt closed 7 years ago
Hi, did you figure out something about this issue?
I also has the same impression. Actually I am almost sure that this implementation is based on object wise (but I don't know the reason, maybe it will give you better performance because you can back propagate object by object). The point is, trainer makes it hard to understand how the training is done (e.g. object-wise of image-wise). I am now trying to remove trainer abstraction so that it will be easy to understand what is going on.
It's not an impression, it's a fact. Good news is it comes from the dataset manipulation rather than the model restriction itself. You can simply modify the dataset (VOC class) such that it's per-image and not per object. E.g.:
class NonFlattenedVOC(VOC):
def __init__(self, img_dir, anno_dir, list_dir, list_suffix, use_diff=False):
super(NonFlattenedVOC, self).__init__(img_dir, anno_dir, list_dir, list_suffix, use_diff=use_diff)
def parse_anno(self):
self.images = []
for fn in glob.glob('{}/*.xml'.format(self.anno_dir)):
tree = ET.parse(fn)
filename = tree.find('filename').text
img_id = os.path.splitext(filename)[0]
if img_id not in self.use_list:
continue
objects = []
for obj in tree.findall('object'):
if not self.use_diff and int(obj.find('difficult').text) == 1:
continue
bb = obj.find('bndbox')
bbox = [int(bb.find('xmin').text), int(bb.find('ymin').text),
int(bb.find('xmax').text), int(bb.find('ymax').text)]
bbox = [float(b - 1) for b in bbox]
objdata = {
'name': obj.find('name').text.lower().strip(),
'pose': obj.find('pose').text.lower().strip(),
'truncated': int(obj.find('truncated').text),
'difficult': int(obj.find('difficult').text),
'bndbox': bbox,
}
objects.append(objdata)
imgdata = {
'filename': filename,
'objects': objects
}
self.images.append(imgdata)
def __len__(self):
return len(self.images)
def get_example(self, i):
imgdata = self.images[i]
# Load image
img_fn = '{}/{}'.format(self.img_dir, imgdata['filename'])
img = cv.imread(img_fn).astype(np.float)
img -= self.mean
# Scaling
im_size_min = np.min(img.shape[:2])
im_size_max = np.max(img.shape[:2])
im_scale = float(self.IMG_TARGET_SIZE) / float(im_size_min)
# Prevent the biggest axis from being more than MAX_SIZE
if np.round(im_scale * im_size_max) > self.IMG_MAX_SIZE:
im_scale = float(self.IMG_MAX_SIZE) / float(im_size_max)
img = cv.resize(img, None, None, fx=im_scale, fy=im_scale,
interpolation=cv.INTER_LINEAR)
h, w = img.shape[:2]
im_info = np.asarray([h, w, im_scale], dtype=np.float32)
img = img.transpose(2, 0, 1).astype(np.float32)
# Ground truth boxes
gt_boxes = [tuple(obj['bndbox']) + (self.LABELS.index(obj['name']),) for obj in imgdata['objects']]
gt_boxes = np.array(gt_boxes, dtype=np.float32)
return img, im_info, gt_boxes
You also have to add these lines
if gt_boxes is not None:
gt_boxes = gt_boxes.reshape((-1, 5))
under this statement on line 52 of models/faster_rcnn.py
if self.train:
im_info = im_info.data
gt_boxes = gt_boxes.data
Otherwise the gt_boxes
will be 3d (suspecting the optimizer's converter
method adds the leading batch dimension, but not sure).
The akward part is that so far I have not been able to get the model to converge neither for "flattened" nor for "non-flattened" dataset...
@manuelschmidt @apple2373 @paleckar Hi, thank you for trying this chainer-faster-rcnn codes. We've released the new clean & simple & reliable (it surely reproduced the paper result) version of Faster R-CNN inference & training codes as ChainerCV: https://github.com/pfnet/chainercv . I think the problem discussed here has been solved in ChainerCV. Please check the ChainerCV repo!
Hello,
I looked a bit into your code because I wanted to train on another dataset just to try it out and see how Chainer's Faster-RCNN version works compared to the original one.
To me it seems that your training cycle is different from the cycle that the original implementation uses. More specifically: You go through all images and then you create an object in your database for every object in the image. You don't build up a matrix like is the case in the python-caffe implementation. Doing the training image-wise would save much computation time since you would not have to backpropagate the same image for every single object in the image. Instead you would propagate once, learn the bbox and cls regression and go on to the next image. Regarding PASCAL VOC the difference might be not big since most images contain only one object. But thinking of Datasets that have like 10 objects per image, I that you could be at least around 8times faster by changing the training.
Or am I doing a misinterpretation here? In that case I would like to understand the code a bit better and it would be great if you would point me to the important parts of the code where you merge objects and images again for training.
Thanks for this great code - I quite enjoyed working with it since it's very readable thanks to Chainers clean style.
Best regards and thanks
Manuel