Can support directly read from lmdb format and can share your a small part of training dataset?

smil4ever commented 7 years ago

Your job is very good! The tools, retrain and train, both can not support directly read from lmdb format dataset. But very much dataset are lmdb. When i am traing GoogleNet on my custom dataset, get some errors. So i think my dataset have some errors. So i want to know what yours dataset are. Or where can i get "ILSVRC 2014 > GoogleNet/Inception" standard dataset. Or can u share your training a small part of dataset? Or about how to make dataset from some yourself images.

2017-09-19_102147

2017-09-19_103103

smil4ever commented 7 years ago

The imagenet dataset ilsvrc12 has a volume of up to 160G, so it is not possible to execute the pre_process function each time to preprocess the data. Therefore, you must support directly read it from the leveldb or lmdb database. And I did not find the leveldb conversion tool from Caffe or Caffe2.

leovandriel commented 7 years ago

Hi @smil4ever,

The GoogleNet model I use is pretrained and downloaded from https://github.com/caffe2/models/tree/master/bvlc_googlenet. I do not have to original dataset that was used for training this model. Indeed, it's probably in the hundreds of GBs.

The train and retrain examples generate an internal database that is used during training. By default this is a leveldb, but you can use lmdb if you like by adding --db_type lmdb. I've tested this preprocessing with databases up to 1GB. The preprocessing magic all happens in misc.h.

I do not know of a conversion tool for databases. Why do you need one? I figure the format of these is the same for both Caffe and Caffe2.

smil4ever commented 7 years ago

Well, thank you. Can train read directly from common LMDB files by cmd option like "--dataset=xxx/xxx/xx_train_lmdb" with no pre_processing? If can not, can you provide a method to realize that fuction? Thanks a lot!

leovandriel commented 7 years ago

I have not implemented such feature as it would be slightly more complicated than just an extra command line option. Preprocessing generates 3 databases: train, validate and test. It also randomizes the order of samples and scales/crops the images. And then there's the byte order NCHW vs NHWC, float or int values, and the scaling of these values. On top of that, tensors are stored uncompressed, which easily makes them 10x larger than jpg files. In comparison, just some folders with jpegs is a lot easier.

Instead I would like to encourage you to try to modify the train example yourself. You can skip the preprocessing step by removing

pre_process(image_files, db_paths, FLAGS_db_type, FLAGS_size_to_fit);

and modifying the line

db_paths[i] = path_prefix + name_for_run[i] + ".db";

to point to your database files for training, validation, and testing. Also, set db_type to lmdb. That should be all you need to do to get train.cc to use your database directly.

Note that I distribute the images over these: 70% in test, 20% in validation, and 10% in test, see percentage_for_run in misc.h. Make sure to distribute your images once to avoid validation and test data to leak into the training process.

And then there's the format of the image inside the tensor. Use 224x224 (size_to_fit) size images, order NCHW, 32bit floats, with values between -128 and 128. You can choose to do this image manipulation in Caffe2 operators, but I preferred to do that outside of Caffe2 and have images in the database be ready for training.

Cheers!

leovandriel / caffe2_cpp_tutorial

Can support directly read from lmdb format and can share your a small part of training dataset? #15