cublas initialization error

cbecker commented 9 years ago

Hi, first thanks for this great software :)

I am trying to run the examples, but I am getting runtime errors, namely:

caffe_neural_models/dataset_01 (master ✘)✹ ᐅ bash train.sh 
E0716 15:20:28.804553 11835 common.cpp:123] Cannot create Cublas handle. Cublas won't be available.
E0716 15:20:28.805637 11835 common.cpp:130] Cannot create Curand generator. Curand won't be available.
F0716 15:20:28.908949 11835 common.cpp:384] Check failed: status == CUBLAS_STATUS_SUCCESS (1 vs. 0)  CUBLAS_STATUS_NOT_INITIALIZED
*** Check failure stack trace: ***
    @     0x7f6e37237a1d  google::LogMessage::Fail()
    @     0x7f6e372398bd  google::LogMessage::SendToLog()
    @     0x7f6e3723760c  google::LogMessage::Flush()
    @     0x7f6e3723a1de  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6e37bf7f88  caffe::Caffe::SetDevice()
    @           0x45bb7b  main
    @     0x7f6e358e976d  (unknown)
    @           0x411f99  (unknown)
train.sh: line 1: 11835 Aborted                 (core dumped) ./../../caffe_neural_tool/build/caffe_neural_tool --gpu 3 --train 4 --proto 'train_process_usk_2.prototxt'

I suppose there is an issue with cublas, but I am not sure how to fix this. I am linking against ViennaCL, but I am not sure where cublas comes from.

do you have any suggestions about what to try next? thanks!

naibaf7 commented 9 years ago

First of all, how did you compile Caffe itself? It seems as if you have enabled CUDA compilation there. Remove it if no nVidia GPU is present, this might be one of the issues. Second, make sure what device IDs are available by running: ./caffe_neural_tool --devices and then edit the train.sh to use the correct GPU. At the moment it is GPU no. 3, which might be wrong. Also make sure you use either an nVidia GPU, AMD GPU or non-NUMA processor, as NUMA processors will have cache invalidation issues with the multicore-backend of Caffe. Device fission might fix this in the future but currently, single-CPU systems are way faster than any NUMA (dual CPU) systems. Make sure you compile with ViennaCL but use clBLAS as a BLAS for AMD GPUs, cuBLAS for nVidia GPUs and openBLAS or MKL for CPUs.

You also might want to start training with an SK or U net example, as USK is not very fine tuned (yet).

cbecker commented 9 years ago

Thanks. I was looking at the Makefile.config files again, and I realized that I was linking against the wrong cuda version for caffe_neural_tool.

Now, to try it out, which is the easiest example to run, to try training and testing? When I try to run ./tran.sh I get some errors, but I guess there is somewhere a simple example that should work.

One last question: in the ground truth images, is it possible to ignore a particular label. This is to deal with partially labeled data (for example, most pixels in a training image may not be labeled).

naibaf7 commented 9 years ago

Yes this is possible. The value for pixel label to ignore can be set into the network prototxt, in the softmax loss param as "ignore_label". In this case though, don't use the masking parameter provided by my tool, as that will set pixel labels to -1 in order to ignore them.

The easiest example is to run the existing train/process for the SK network. Always make sure the relative paths fit and all folders necessary exist. Just send me the error logs if you can't figure out something on your own.

Many more details though will only be available after the 24th of August after I'm done writing my thesis.

cbecker commented 9 years ago

Thanks. For example, I get the following error:

caffe_neural_models/dataset_01 (master ✘)✹ ᐅ bash train.sh  
I0717 15:07:19.546991 10671 caffe_neural_tool.cpp:113] Training mode.
F0717 15:07:19.547129 10671 train.cpp:17] Train parameter index does not exist.
*** Check failure stack trace: ***
    @     0x7fa32102da1d  google::LogMessage::Fail()
    @     0x7fa32102f8bd  google::LogMessage::SendToLog()
    @     0x7fa32102d60c  google::LogMessage::Flush()
    @     0x7fa3210301de  google::LogMessageFatal::~LogMessageFatal()
    @           0x4a3ed3  caffe_neural::Train()
    @           0x46087d  main
    @     0x7fa31f6df76d  (unknown)
    @           0x412e49  (unknown)
train.sh: line 1: 10671 Aborted                 (core dumped) ./../../caffe_neural_tool/build/caffe_neural_tool --gpu 3 --train 4 --proto 'train_process_usk_2.prototxt'

I am not sure where to find the SK network train.sh, so I tried with the first example I found.

naibaf7 commented 9 years ago

Use in dataset_01: --gpu 0 --train 0 --proto 'train_process_sk_9.prototxt'

Then use the snapshot after for example 10'000 steps: --gpu 0 --process 0 --proto 'train_process_sk_9.prototxt'

and make sure there is something to process in a folder called 'input' as I only provide training/ground truth in the repository and no test data.

Adapt the train_process_sk_9.prototxt if necessary. It should be straight forward and the functions can be traced back easily in the code, until my full documentation is available.

cbecker commented 9 years ago

that's great, thanks, I will give it a try.

cbecker commented 9 years ago

I managed to make it work with the examples. I am trying with my data now, which has labels 0,1 and 255. And I should ignore 255, but use 0 and 1 for a binary problem.

If I exclude the label from the label_consolidate configuration, will it still try to use it? I tried with the ignore_label, which is -1, but in that case I was getting nan as loss.

naibaf7 commented 9 years ago

OK I actually never thought of trying it that way - let me figure out where the bug stems from. Maybe I can fix it quickly or give you the correct settings.

cbecker commented 9 years ago

thanks, that's great. Also, if you have a paper I can cite for your work, or it could be the thesis afterwards, let me know.

naibaf7 commented 9 years ago

There will maybe be a paper, but that's going to take a while. Until then, you can cite by name and github URL. And after the 24th of August I'll also post a link to the thesis somewhere.

naibaf7 commented 9 years ago

Are your labels integer values in the image? The tool will typically assign 0 to the lowest integer value label, 1 to the next one and so on. In grayscale images! Something seems to be going wrong then. It should map 0 to 0, 1 to 1 and 255 to 2. Best if you exclude label consolidate if you don't use it - just remove the whole block from the processing. And change the ignore_label to the actual label you want to ignore in the neuraltissue_net.prototxt here:

layer {
  include: {phase: TRAIN}
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip3"
  bottom: "label"
  loss_param {
    ignore_label: 2
    normalize: true
  }
}

Also remember to set the correct number of output labels (last layer output number) in both prototxt files. Try with both 2 (what you wish) and 3 (maybe it needs to see the ignored label still).

cbecker commented 9 years ago

oh ok I see, I think I am understanding now. Would 255 in the image be taken as label 255, or label 2?

naibaf7 commented 9 years ago

@cbecker You're right, I fixed it now. Also, the debug output shows which label has been seen by training how often. The fact that you see zeros for most labels except one is indicative for something going wrong. NaN stems from division by zero.

cbecker commented 9 years ago

How can I check how many labels it is detecting, and how many samples per label? I am training now and my dataset is very skewed, so the loss goes way up, then down:

I0717 18:10:23.551026  7473 solver.cpp:224] Iteration 2200, loss = 25.6082
I0717 18:10:23.551105  7473 solver.cpp:501] Iteration 2200, lr = 0.00086145
I0717 18:10:37.191097  7473 solver.cpp:224] Iteration 2250, loss = 27.6978
I0717 18:10:37.191189  7473 solver.cpp:501] Iteration 2250, lr = 0.000858812
I0717 18:10:50.838942  7473 solver.cpp:224] Iteration 2300, loss = 43.4337
I0717 18:10:50.839028  7473 solver.cpp:501] Iteration 2300, lr = 0.000856192
I0717 18:11:04.493456  7473 solver.cpp:224] Iteration 2350, loss = 32.9644
I0717 18:11:04.493532  7473 solver.cpp:501] Iteration 2350, lr = 0.000853591
I0717 18:11:18.124866  7473 solver.cpp:224] Iteration 2400, loss = 49.9584
I0717 18:11:18.124938  7473 solver.cpp:501] Iteration 2400, lr = 0.000851008
I0717 18:11:31.758568  7473 solver.cpp:224] Iteration 2450, loss = 33.9239
I0717 18:11:31.758651  7473 solver.cpp:501] Iteration 2450, lr = 0.000848444
I0717 18:11:45.394291  7473 solver.cpp:224] Iteration 2500, loss = 13.3905
I0717 18:11:45.394366  7473 solver.cpp:501] Iteration 2500, lr = 0.000845897
I0717 18:11:58.996084  7473 solver.cpp:224] Iteration 2550, loss = 52.0906
I0717 18:11:58.996172  7473 solver.cpp:501] Iteration 2550, lr = 0.000843368
I0717 18:12:12.502898  7473 solver.cpp:224] Iteration 2600, loss = 50.3209
I0717 18:12:12.502974  7473 solver.cpp:501] Iteration 2600, lr = 0.000840857
I0717 18:12:25.900024  7473 solver.cpp:224] Iteration 2650, loss = 21.3011
I0717 18:12:25.900107  7473 solver.cpp:501] Iteration 2650, lr = 0.000838363
I0717 18:12:39.306751  7473 solver.cpp:224] Iteration 2700, loss = 4.43506
I0717 18:12:39.306835  7473 solver.cpp:501] Iteration 2700, lr = 0.000835886
I0717 18:12:52.684272  7473 solver.cpp:224] Iteration 2750, loss = 85.4389
I0717 18:12:52.684355  7473 solver.cpp:501] Iteration 2750, lr = 0.000833427
I0717 18:13:06.066045  7473 solver.cpp:224] Iteration 2800, loss = 11.0876
I0717 18:13:06.066126  7473 solver.cpp:501] Iteration 2800, lr = 0.000830984
I0717 18:13:19.453717  7473 solver.cpp:224] Iteration 2850, loss = 28.2095
I0717 18:13:19.453801  7473 solver.cpp:501] Iteration 2850, lr = 0.000828558
I0717 18:13:32.837424  7473 solver.cpp:224] Iteration 2900, loss = 5.7144
I0717 18:13:32.837505  7473 solver.cpp:501] Iteration 2900, lr = 0.000826148

I0717 18:13:46.211272  7473 solver.cpp:224] Iteration 2950, loss = 4.69093
I0717 18:13:46.211349  7473 solver.cpp:501] Iteration 2950, lr = 0.000823754

naibaf7 commented 9 years ago

Enable debugging for the caffe_neural_tool if you want to see the label counts. add --debug and --graphic if you want to see a bit of what's going on. Maybe you should also decrease the learning rate (0.0001) and increase momentum (0.99) in the neuraltissue_solver.prototxt for this data set.

Feel free to fork and add new stuff to this tool if it helps you. I will review and accept pull requests.

cbecker commented 9 years ago

Thanks ;)

I'm confused, because I see as if label 1 and 2 are there, but 0 is not:

Label: 0, 0
Label: 1, 6534
Label: 2, 574

and my labels TIF file has values 0,1 and 255.

naibaf7 commented 9 years ago

If you enable the patch prior and masking function, what initial statistics does the tool output when starting the training?

Ohh I think I see what I did there. Label 0 is actually -1 if masking is enabled. Soo... label 1 is your label 0 and label 2 is your label 1 in that statistics and all is fine. I need to fix that when I find time. I also assume you used label count 2 instead of 3. Otherwise you'd see a Label: 3, xxx entry where xxx is the number of labels 255 you have in the image.

I suggest you set the number of labels to 3 in the tool and to 2 in the networks. This should, according to my code review now, fix all issues.

The actual code snippet responsible as proof:

    // TODO: Only enable in debug or statistics mode
    for (int y = 0; y < patch_size; ++y) {
      for (int x = 0; x < patch_size; ++x) {
        labelcounter[patch[1].at<float>(y, x) + 1] += 1;
      }
    }

cbecker commented 9 years ago

Hi, Thanks.

Actually, if I set the number of classes to 3 the training loss never goes down. So I am using 2. But then, if I disable CLAHE normalization, it doesn't converge either.

do you have any idea of what could be happening? The config files are here: http://pastebin.com/L0NphqbP

EDIT: it seems to be that masking: false solves it. However, what is the effect of histeq in that case? I want to avoid histogram equalizing the patches or images.

thanks

naibaf7 commented 9 years ago

Yes you should disable masking on your dataset. Remove the histogram equalization block. Histeq turned out to be useful on the ISBI 2012 dataset and also on FlyEM data with 9 labels. If the data is already only partially masked or labelled for some reason, the histogram equalization will interfere heavily.

I apologize for the difficulty to set it up currently. The parameters need to be vastly different on different kinds of datasets.

cbecker commented 9 years ago

Thanks. When disabling histeq, I get nan after 50 iterations. Do you think this is related to patch normalization and learning rate?

I0718 23:03:19.342078 10622 solver.cpp:224] Iteration 0, loss = 0.693077
I0718 23:03:19.342238 10622 solver.cpp:501] Iteration 0, lr = 0.001
I0718 23:03:29.735518 10622 solver.cpp:224] Iteration 50, loss = nan
I0718 23:03:29.735569 10622 solver.cpp:501] Iteration 50, lr = 0.000996266
I0718 23:03:40.074090 10622 solver.cpp:224] Iteration 100, loss = nan
I0718 23:03:40.074154 10622 solver.cpp:501] Iteration 100, lr = 0.000992565
I0718 23:03:50.432718 10622 solver.cpp:224] Iteration 150, loss = nan
I0718 23:03:50.432777 10622 solver.cpp:501] Iteration 150, lr = 0.000988896
I0718 23:04:01.384943 10622 solver.cpp:224] Iteration 200, loss = nan
I0718 23:04:01.385005 10622 solver.cpp:501] Iteration 200, lr = 0.000985258

naibaf7 commented 9 years ago

Are there patches in your dataset that lack labels 0 and 1 completely and only expose label 2, or any other such odd combination? If that is the case this might well be a problem because the Caffe library might divide by zero at some point, namely when normalizing the loss by the amount of labels present, which would be 0 when only label 2 is seen. In that case, keeping patch_prior to true and masking to false should not give NaN, is that correct?

cbecker commented 9 years ago

exactly, some parts have only label 2 (ignore).

yes, if I put patch_prior to true and masking to false, it seems to work. will it do histogram eq. if masking = false? I want to make sure I disable histogram eq. as I think it is hurting in my case.

naibaf7 commented 9 years ago

It will prioritize patches with rare labels if patch prior is enabled. This will equalize the histogram of labels slightly but not completely. It could be that this even sets the priority of patches with label 2 only to zero. But beware, I did not exactly test this behavior.

If you want to be very sure, then the best thing would be to either fix the loss function in Caffe to avoid division by zero and then disable the patch prior again, or otherwise fix the neural tool to not expose Caffe to patches with invalid/ignored labels only.

I could fix the loss function if you want.

naibaf7 commented 9 years ago

Actually fixed it just now in my Caffe branch if you want to try again. Just make sure to compile both Caffe and the Tool again.

cbecker commented 9 years ago

I see, I am starting to understand better now.

ok great. So I should just disable the histeq block?

naibaf7 commented 9 years ago

Yes remove it and try again with the updated code.

Fix:

        // Fix the division by zero bug
        if (count > 0) {
          caffe_gpu_scal(prob_.count(), loss_weight / count, bottom_diff);
        }

count was zero here with your labels, making the loss go NaN.

cbecker commented 9 years ago

actually now it still throws nan, now it does it at the beginning already. let me know if I can help with any other debugging info.

I can also send you an example image and files, if that helps.

naibaf7 commented 9 years ago

I tried something else now. I don't know if that would help as I pretty much know where the error comes from. It really is the division by zero or generally zero valid labels present, which is not handled by Caffe currently.

cbecker commented 9 years ago

Thanks, though it's still not working. In case it helps, here you have a working example for training: http://cvlabwww.epfl.ch/~cjbecker/tmp/test.tar.gz

naibaf7 commented 9 years ago

I used it like that now, with the newest version of Caffe & the tool, and not getting NaN anymore:

# The training protocol buffer definition
train_net: "neuraltissue_net.prototxt"
########################################################################
# The testing protocol buffer definition
# test_net: "../net_sk_2out/neuraltissue_net.prototxt"
########################################################################
# Test_iter specifies how many forward passes the test should carry out.
# it is the number of batches shown, then 
# examples shown = 'test_iter'*batch_size
# Carry out testing every 'test_interval' training iterations.
# test_iter: 1000
# test_interval: 500
########################################################################
# The base learning rate, momentum and the weight decay of the network.
# base_lr: 0.05
base_lr: 0.001
momentum: 0.9
weight_decay: 0.0005
########################################################################
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
#lr_policy: "step"
#gamma: 0.1
#stepsize: 20000
########################################################################
# The maximum number of iterations
max_iter: 100000
########################################################################
# Snapshot intermediate results
snapshot: 2000
snapshot_prefix: "neuraltissue_sk_2out"
########################################################################
# Display every 'display' iterations
display: 5
########################################################################

train {
  # solverstate: "neuraltissue_sk_2out_iter_16000.solverstate"
  solver: "neuraltissue_solver.prototxt"
  input {
    padding_size: 102
    patch_size: 64
    channels: 3
    labels: 3
    batch_size: 1
    raw_images: "train/raw2"
    label_images: "train/gt2"
    preprocessor {
      normalization: true
      rotation: true
      mirror: true
      clahe {
        clip: 4.0
      }
      crop {
        imagecrop: 1
        labelcrop: 0
      }
      histeq {
        patch_prior: true
        masking: false
      }
    }
  }
}

process {
  process_net: "neuraltissue_net.prototxt"
  # caffemodel: "neuraltissue_sk_2out_iter_100000.caffemodel"
  input {
    padding_size: 102
    patch_size: 128
    channels: 3
    labels: 2
    batch_size: 1
    raw_images: "input"
    preprocessor {
      normalization: true
      clahe {
        clip: 4.0
      }
      crop {
        imagecrop: 1
        labelcrop: 0
      }
    }
  }
  filter_output {
    output_filters: false
    output: "sk_filters"
  }
  output {
    format: "tif"
    fp32_out: false
    output: "output"
  }
}

and:

layer {
  include: {phase: TRAIN}
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip3"
  bottom: "label"
  loss_param {
    ignore_label: 2
    normalize: true
  }
}

Now you'll see 0 loss if it only sees label 2. Also that could mess a bit with the momentum and how fast it converges, so I'd still keep the histogram equalization (without masking) on.

Now it's up to you to get it to train nicely, which might not be easy.

cbecker commented 9 years ago

Thanks, that's great! yes, I see about loss 0, I will give it a try and let you know, thanks again!

cbecker commented 9 years ago

one more question: it is strange to me that the histogram equalization block is only in the training part. Does this mean to histogram equalize the input patch? because then I would expect to see it in the predict part as well.

naibaf7 commented 9 years ago

It only prioritizes picking of training patches during training. Histogram equalization and masking has only an effect on what kind of errors the network will be seeing how often during SGD.

In prediction mode, it will just go through the patches linearly and give a prediction for each pixel.

cbecker commented 9 years ago

I think it is working now, thanks.

What I notice is that the output probabilities are very 'thick' (in my case I have synapses). Typically I would expect to see a transition zone from the border of the synapse, to the outside, with decaying probability. Have you experienced this issue as well? also, is it possible to output, instead of the probability output of the last sigmoid, the input to that sigmoid? In the case of a two class problem one typically plots the classifier score, before reaching the sigmoid, as the sigmoid may squeeze things too much (especially for unbalanced training data like this one).

naibaf7 commented 9 years ago

Yes, that might happen.

You can just remove the Softmax/Sigmoid in the end which is labelled with a phase: TEST include clause at the end of the network. In this case, it should pick up the pre-softmax output in the tool.

Remove this block in the network prototxt:

layer {
  include: {phase: TEST}
  name: "prob"
  type: "Softmax"
  bottom: "ip3"
  top: "prob"
}

cbecker commented 9 years ago

Great, that works.

I talked to Jan and he showed me your results for the dataset at https://github.com/unidesigner/groundtruth-drosophila-vnc , which look very good

Do you have the parameters you used to train that network? and how many iterations approximately?

naibaf7 commented 9 years ago

The best results I have on ISBI 2012 now are with: Network: SK Training: Softmax + Malis (loss function, 10'000 iterations each) The previously best results were on SK with Softmax loss only, approximately 30'000 iterations is what we used. I think it's about the same for the other dataset. But Malis can only be used properly if label 0 is background that separates objects.

cbecker commented 9 years ago

Thanks. I did some tests, and performance is still a bit low. I was trying to see the filters learned, to see if something weird is happening, but I am getting an error while loading the model and parameters in python:

IndexError                                Traceback (most recent call last)
<ipython-input-4-5f4574a759ce> in <module>()
----> 1 net = caffe.Classifier("net_sk_2out/neuraltissue_net.prototxt", "hipp/neuraltissue_sk_2out_iter_200000.caffemodel" )

/home/cjbecker/filer/jan-caffe/caffe/python/caffe/classifier.pyc in __init__(self, model_file, pretrained_file, image_dims, mean, input_scale, raw_scale, channel_swap)
     27 
     28         # configure pre-processing
---> 29         in_ = self.inputs[0]
     30         self.transformer = caffe.io.Transformer(
     31             {in_: self.blobs[in_].data.shape})

IndexError: list index out of range

I suppose this is an issue with patch size and input layers, but I am not sure how to solve it.

naibaf7 commented 9 years ago

Do you want to see the individual filtered stages after each layer or the filter kernels themselves? The first point can be done in the tool itself (output_filters parameter). The second is not implemented (yet).

In any case, you cannot do forward/backward processing with the python interface, the memory data interface used in the networks only work together with the C++ interface currently. You'd have to use python data layers to do so, and change the networks.

cbecker commented 9 years ago

I meant to show the filter kernels, as they would look like noise if there is an issue during learning. If there was a way to extract the coefficients from the model file, then that would be enough.

I can also try to look at the output of the first layer, that could give me a hint. This is done by removing or commenting out the other layers, right?

naibaf7 commented 9 years ago

No, you can output everything during processing by using:

  filter_output {
    output_filters: true
    output: "sk_filters"
  }

inside the process {} block of train_process_sk_2.prototxt

It slows everything down though, as this writes many thousands of images to the harddrive.

cbecker commented 9 years ago

ah great, thanks ;)

another observation from looking at the test output: I think normalization is not working very well, because, depending on the slice, there is a shift in the output of the network, and it is causing the output to vary too much between consecutive slices.

If normalization is disabled, then the code here https://github.com/naibaf7/caffe_neural_tool/blob/31c5a2c635062e9e0887719f20a77f674ee9c709/src/image_processor.cpp#L70 doesn't make the pixels go between -1 and +1, but instead between 0 and 1, right? I am going to play with this because the variations in the output are significant with the current min/max scheme.

naibaf7 commented 9 years ago

Yes, you should definitely check what min/max normalization and CLAHE does to your data, and whether it helps or hurts the classification.

naibaf7 commented 9 years ago

For reference: Currently, the ISBI dataset with CLAHE and Min/Max normalization scores 12th on my implementation: http://brainiac2.mit.edu/isbi_challenge/leaders-board (the INI entry)

cbecker commented 9 years ago

Great, thanks.

I tried on my dataset (synapses) and I get ok results, though much worse than a random forest trained on some custom simple features. When testing on the training images, I see that there are negatives that are misclassified as positives, probably because they have little chance to appear during SGD. Have you ever tried hard negative mining? or is there a way to favor sampling some samples (or regions) more than others?

naibaf7 commented 9 years ago

You'd probably have to change to minibatch training instead of patch training. As it is currently, to save computations and speed up, it always trains SK with 64 by 64 pixels that are not sampled i.i.d. but correlate very strongly (they are in the same local patch). Even histogram equalization (patch prior and masking) cannot correct this issue completely.

Depending on the dataset, this gives worse results, as you noticed, and in this case you need to prepare a HDF5 dataset with the samples as you wish them to be picked (i.i.d. from a distribution that you think will result in better training). Then you need to swap out the MemoryDataLayer for a HDF5 data layer and change the input size to 102x102, label size to 1x1 and batch size to 256 and train it without my tool (which is for patch training rathern than minibatch training).

Afterwards, processing can happen again with a patch label size of 128x128 and the speedup is given again. The results will be numerically identical to batch processing.

cbecker commented 9 years ago

I see. but it would also be possible to modify the histeq module and the patch_prior, to weight certain patches more than others. Then I could have an image that has a weight per pixel, that 'guides' learning, I think that could be a first approximation to achieve this.

naibaf7 commented 9 years ago

Yes there are actually many things I would have liked to include in the tool, but currently do not have time to program:

Support for i.i.d. minibatch training (still the best to ensure statistical stability for the training)
Additional pixel error weighting/scaling metrics (via external weight map or internally precomputed)
Additional masking via external mask map
Support for 3D networks

One issue with additional weight maps is that they have to be passed into the caffe library as an additional MemoryDataLayer and the SoftmaxLoss has to be modified to accept such a map as additional bottom blob.

Minibatch support would be the easiest to implement without having to change too much.

cbecker commented 9 years ago

Hi there. I am done making a few experiments, thanks for all the support.

I managed to implement loading the external 'weight map' for sampling, which helped a bit in some cases.

In terms of performance, the CNNs do much worse than our approach, even when trained on the largest training set we have. I don't think this is an issue of your implementation, as we saw this when using Caffe as well. To get better performance I think we need to add some prior to the network.

Just to be sure, the sk_2 network we talked about, included in the examples, has an 'equivalent patch size' or 'context' of 100x100 pixels, right? I mean, it would be equivalent to running a per-patch trained CNN whose input layer is of 100x100.

naibaf7 commented 9 years ago

okay, interesting... the context is 102 by 102

naibaf7 / caffe_neural_tool

cublas initialization error #1