kzwkt / wnd-charm

Automatically exported from code.google.com/p/wnd-charm
0 stars 0 forks source link

Unexpected error when testing with "continuous" dataset #29

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Another one from Jimmy:

Hi Chris,

      Sorry to bother you again. Training seems to be working fine (I'm training "continuously," with a numeric value for every image). After I generate the training file though, I can't seem to test it. The error looks something like "class labels are purely numeric... max balanced training 1... no images left for testing." It looks like the program thinks each image is a different class (because the labels are numeric), and therefore has only one image in each class. I get this error even with the -C option (interpolation).
      Do you know what might be going on? It could be that I'm just using the commands wrong. Right now I have a single folder that I generate a single .fit file from. Then I try to do wndchrm test -iNjNnNC... on this file.
      Thanks again for your help!

Best,
Jimmy

Original issue reported on code.google.com by christop...@nih.gov on 21 Apr 2011 at 5:54

GoogleCodeExporter commented 9 years ago
Thank you for helping out.

I have attached a "testset" on which I created a .fit file. I am training with 
a txt file that contains file names associated with numeric values, so I'm 
hoping that in testing, wnd-charm can do some interpolation with the values.

The error I get seems to be that wnd-charm thinks each image is in its own 
discrete class (even when I use the -C option).

The testset was placed directly into the wndcharm directory. Everything is 
there, except the images themselves (since these are confidential). But if you 
need them you reproduce the error, I can email you those images personally.

Thanks again!

Original comment by jimmyst...@gmail.com on 22 Apr 2011 at 3:52

Attachments:

GoogleCodeExporter commented 9 years ago
The -C option doesn't really work (maybe we should remove it from the 
documentation?)

The .txt file looks good.  It should print out a summary of your dataset - does 
that look right?
It will put each image in a separate class by value, then, because each class 
name is numeric, it will automatically compute a Pearson correlation between 
each image's predicted value and the value in the file.

Because you have only one image per class, you can't really do a standard 
classification test.  The way to get around this is to provide a separate test 
file.

You can test this dataset against itself to see what kind of correlation you 
get.  Its an almost-equivalent to doing a leave-one-out classification.  The 
difference is that the image being tested will have contributed to the weights 
used in the classifier feature space, but the "collision" will be detected 
between the test image and its corresponding training image, and the training 
image's contribution to the marginal probability for the test image will be 
ignored (because its infinite).

To do this test, just use the same file (the .txt or a .fit made from the .txt 
using train) in a wndchrm classify command.  This will use all of the images 
for training, and all of the (same) images for testing.  You still have to 
specify two files, so just use the same file twice.

If the results look promising for the self-test, to do a real cross-validation, 
you would set up two .txt files with two separate non-overlapping sets of 
images.  Then use one of the sets as your <train set>, and the other one as 
your <test set>.  Using the "test" command will let you train and test using 
randomly selected subsets of your files (splits, we call them).  Using 
"classify" will use all of the images in <train set> for training, and all of 
the images in <test set> for testing.  If you specify image sub-sets (using -i 
and/or -j), the sub-sets will be picked randomly by "test", but in file-order 
by "classify".

Original comment by i...@cathilya.org on 22 Apr 2011 at 4:55

GoogleCodeExporter commented 9 years ago
Jimmy,

The -C option really should be removed, because in this branch (1.30) there is 
nothing "continuous" about how wndcharm generates an interpolated score. As it 
currently works, each image with its own value assigned to it is placed into 
its own discrete class. If there are more than one image with a certain value, 
then that image gets lumped into the class with all the other images of that 
value. wndcharm will then calculate Fisher weights, which emphasize 
differentiation among image classes for use in a WND (Weighted Neighbor 
Distance) classifier. However for a continuous dataset such as yours, Jimmy, 
the correct weights to use are Pearson weights. There is code to do this in the 
1.30 branch but it is commented out. And even if you uncommented it, you still 
would be using it to classify test images against discrete images classes, and 
then interpolating a score off the results of the classification. There has 
been talk in our group for a long time now about the need for pure 
interpolation functionality using some linear regression, and not 
classification over discrete classes. We're not there yet, but we're getting 
there. The functionality you're looking for will not be added to the 1.30 
branch, but soon Ilya will check in the pearson weight functionality into the 
trunk, and you'll be able to check out that source and compile it just as easy 
as you would downloading a RC tarball. At least that will get you part of the 
way there. We'll let you know when that's been done.

Original comment by christop...@nih.gov on 22 Apr 2011 at 5:30

GoogleCodeExporter commented 9 years ago
To clarify, wndchrm 1.30 in its present form will in fact give you 
continuous-value predictions for a dataset such as yours.  It does it in a 
round-about way, which arguably is more unbiased than doing it in a more direct 
way with Pearson weights rather than FIscher weights.

Your images will be assigned to discrete classes for training.  Your testing 
images will be classified into these discrete training classes.  Additionally, 
because your class labels are numeric, it will interpolate a continuous value 
for each test image and report it in the HTML report (along with the Pearson 
correlation and P-value of its success in doing so).  This technique of 
interpolation has given us results that we have shown have underlying molecular 
basis through gene expression, as well as other independent imaging assays.  
Its probably less sensitive than what you could get with a continuous 
classification approach, but it does have the advantage that it is less 
"forced" to give you the answer you want.  Plus its quite sensitive already.

So don't wait for continuous classification to appear in wndchrm - there's 
plenty there right now to explore with interpolating continuous scores for your 
images.

Original comment by i...@cathilya.org on 22 Apr 2011 at 5:53

GoogleCodeExporter commented 9 years ago
That makes sense. For now I think I will split my images into discrete classes 
for training. But the self-correlation trick during testing sounds like an easy 
way to roughly assess how accurate the predictions are.

Thanks for all the help!

Original comment by jimmyst...@gmail.com on 22 Apr 2011 at 11:27

GoogleCodeExporter commented 9 years ago
I'm going to close this issue by commenting out the C option in the help 
message.

Original comment by i...@cathilya.org on 26 Apr 2011 at 5:24