meijieru / crnn.pytorch

Convolutional recurrent network in pytorch
MIT License
2.38k stars 658 forks source link

Why preds variable contains negative values? #40

Closed hellbago closed 6 years ago

hellbago commented 7 years ago

I have looked at the values of variable preds after executing "preds = model(image)" in demo.py. The values are the following: Variable containing: (0 ,.,.) = -106.3455 -115.3943 -114.5584 ... -115.6788 -110.2145 -112.2794

(1 ,.,.) = -67.3953 -92.3248 -92.7227 ... -88.7459 -81.5368 -88.8212

(2 ,.,.) = -56.8008 -89.8197 -92.8852 ... -85.0180 -77.4713 -85.3732 ...

(35,.,.) = -38.8606 -79.6700 -81.3100 ... -71.8229 -57.3992 -68.8093

(36,.,.) = -39.6410 -75.7699 -75.7648 ... -70.3662 -55.7602 -68.3655

(37,.,.) = -45.2289 -77.6819 -77.0921 ... -73.9425 -59.1527 -70.6211 [torch.cuda.FloatTensor of size 38x1x37 (GPU 0)]

For my concerning this values should represents a sequence of probabilities over the classes. So i'm wondering why they are negative values such the ones reported here? @meijieru

meijieru commented 7 years ago

They are logits instead of probabilities, softmax should be applied to get probability

ahmedmazari-dhatim commented 7 years ago

Hey @hellbago,

If l understand well for each prediction let's say

a----v--a-i-l-a-bb-l-e-- => available when we print(preds) we get the following vector


(0 ,.,.) =
-106.3455 -115.3943 -114.5584 ... -115.6788 -110.2145 -112.2794

(1 ,.,.) =
-67.3953 -92.3248 -92.7227 ... -88.7459 -81.5368 -88.8212

(2 ,.,.) =
-56.8008 -89.8197 -92.8852 ... -85.0180 -77.4713 -85.3732
...

(35,.,.) =
-38.8606 -79.6700 -81.3100 ... -71.8229 -57.3992 -68.8093

(36,.,.) =
-39.6410 -75.7699 -75.7648 ... -70.3662 -55.7602 -68.3655

(37,.,.) =
-45.2289 -77.6819 -77.0921 ... -73.9425 -59.1527 -70.6211
[torch.cuda.FloatTensor of size 38x1x37 (GPU 0)]

such that 37 is the length of `alphabet ="0123456789abcdefghijklmnopqrstuvwxyz"+blank `

My question is s follow :
since the prediction is available such that length(available) = 9 1) what represents the following vectors for the prediction "available" ?

(0 ,.,.) = -106.3455 -115.3943 -114.5584 ... -115.6788 -110.2145 -112.2794

(1 ,.,.) = -67.3953 -92.3248 -92.7227 ... -88.7459 -81.5368 -88.8212

(2 ,.,.) = -56.8008 -89.8197 -92.8852 ... -85.0180 -77.4713 -85.3732
.
.
.
(37,.,.) = -45.2289 -77.6819 -77.0921 ... -73.9425 -59.1527 -70.6211

2) l don't understand the dimension [torch.cuda.FloatTensor of size 38x1x37(GPU 0)], what is 38 ? how to read 38x1x37

Thanks a lot @hellbago for your comment and answer .

ahmedmazari-dhatim commented 7 years ago

Hi @meijieru , the values of the vectors :

0 ,.,.) = -106.3455 -115.3943 -114.5584 ... -115.6788 -110.2145 -112.2794

(1 ,.,.) = -67.3953 -92.3248 -92.7227 ... -88.7459 -81.5368 -88.8212

(2 ,.,.) = -56.8008 -89.8197 -92.8852 ... -85.0180 -77.4713 -85.3732
.
.
.
(37,.,.) = -45.2289 -77.6819 -77.0921 ... -73.9425 -59.1527 -70.6211

represents the output of B-LSTM, aren't they ? If yes. Then these values are logits (inverse of sigmoid). CTC layer take these values and apply softmax to get probabilities. However l can't find where l can print these probabilites from CTC class

class _CTC(Function):
    def forward(self, acts, labels, act_lens, label_lens):
        is_cuda = True if acts.is_cuda else False
        acts = acts.contiguous()
        loss_func = warp_ctc.gpu_ctc if is_cuda else warp_ctc.cpu_ctc
        grads = torch.zeros(acts.size()).type_as(acts)
        minibatch_size = acts.size(1)
        costs = torch.zeros(minibatch_size)
        loss_func(acts,
                  grads,
                  labels,
                  label_lens,
                  act_lens,
                  minibatch_size,
                  costs)
        self.grads = grads
        self.costs = torch.FloatTensor([costs.sum()])
        return self.costs

    def backward(self, grad_output):
        return self.grads, None, None, None

class CTCLoss(Module):
    def __init__(self):
        super(CTCLoss, self).__init__()

    def forward(self, acts, labels, act_lens, label_lens):
        """
        acts: Tensor of (seqLength x batch x outputDim) containing output from network
        labels: 1 dimensional Tensor containing all the targets of the batch in one sequence
        act_lens: Tensor of size (batch) containing size of each output sequence from the network
        act_lens: Tensor of (batch) containing label length of each example
        """
        _assert_no_grad(labels)
        _assert_no_grad(act_lens)
        _assert_no_grad(label_lens)
        return _CTC()(acts, labels, act_lens, label_lens)
ahmedmazari-dhatim commented 7 years ago

Hey @hellbago , let me first thank you for your tag.

How did you get these discrete values when doing print(preds) ? What do these discrete values represent ? your tensor is of length 26 but it supposed to be 27 , alphabet= 26 + blank

Cheers

hellbago commented 7 years ago

Hi @ahmedmazari-dhatim. As @meijieru said in a previous comment, the values inside preds represent logits. Since they are negatives, they corresponds to probabilities less than 0.5. You can see these value, by debugging the code in demo.py and see the content of preds after the instruction 'preds=model(image)'. The tensor that I obtain has dimension 38X1X37. 37 is the dimension of the alphabet +1(blank), while 38 is the dimension of the sequence of feature vectors that are in input to the recurrent layers. I obtain 38 and not the standard 26 since a rescale the image size in order to keep the aspect ratio, so the width of the input image can be variable while the height is fixed to 32

ahmedmazari-dhatim commented 7 years ago

Hi @hellbago , Thank you for your answer. 1) So 38 represents the output of CNN and the input of RNN, so what is the dimension of your input image to CNN ?

2) I get stuck at understanding the meaning of the vector prediction let's say that our model predict the following :
a----v--a-i-l-a-bb-l-e-- => available

How can l read these values according to the predicted value available 38x1x37

(0 ,.,.) =
-106.3455 -115.3943 -114.5584 ... -115.6788 -110.2145 -112.2794
.
.
(37,.,.) =
-45.2289 -77.6819 -77.0921 ... -73.9425 -59.1527 -70.6211

for instance how can l read the first value of ` (0 ,.,.) = -106.3455` and  ```
(37,.,.) =
-45.2289

according to available . Are the predicted values for each character ?
a v a i l a b l e

Thank you again

random123user commented 7 years ago

Hi @ahmedmazari-dhatim ,

I have a little idea about working and implementation of CTCloss. But by reading comments of @meijieru and @hellbago this is what I inferred.

Recognised word is - "a-----v--a-i-l-a-bb-l-e--- => available" alphabet is - "'0123456789abcdefghijklmnopqrstuvwxyz" with first position being the empty space "-".

The preds variable stores - some random data. But, len(preds) is always 26 and so is len(a-----v--a-i-l-a-bb-l-e---). This is true for every other word.

RECOGNISED WORD LENGTH IS ALWAYS 26. 1) Now, have a look at preds[0][0] - It will give you 37 different numbers. Important is one which is highest. For preds[0][0] the highest number is -88.9130 which is at 12th position in this list. Hence in recognized word first character is 12th character of alphabet '0123456789abcdefghijklmnopqrstuvwxyz" (space being the first character of alphabet) which is 'a'.

2) consider 7th character of "a-----v--a-i-l-a-bb-l-e---" which is 'v'. Have a look at preds[7][0]. The maximum number is the 32nd number. Hence, 7th character in recognized string is the 32nd character of alphabet which is "v".

To get the probabilities of each recognized characters ("a","v","a","i","l","a","b","l","e") I made the following changes in my demo.py. I calculated the softmax and multiplied by 100 to increase precision. Note - here I am calculating 100 times probability value.

m = torch.nn.Softmax()

`model.eval() preds = model(image) temp = preds _, preds = preds.max(2)

preds = preds.squeeze(2) preds = preds.transpose(1, 0).contiguous().view(-1)

preds_size = Variable(torch.IntTensor([preds.size(0)])) raw_pred = converter.decode(preds.data, preds_size.data, raw=True) sim_pred = converter.decode(preds.data, preds_size.data, raw=False) print('%-20s => %-20s' % (raw_pred, sim_pred))

print('after dict - ' + spellchecker.suggest(sim_pred)[0])

arr = preds.data.numpy() for i in range(0,len(temp)): if arr[i] != 0: prob = torch.max(m(temp[i])*100000) print(prob)`

But, I am getting very high probability values. Can you please guide me If my approach is correct?

ahmedmazari-dhatim commented 7 years ago

Hi @random123user , Thanks a lot for your answer. l give a try and let you now (about your code) .

l have a question for you :
we have len(a-----v--a-i-l-a-bb-l-e---)=26 we assume that preds give 37 different numbers. Important is one which is highest. Then : from preds[0, : ] up to preds[25, : ] it returns the values (highest at each vector) that maps the value a-----v--a-i-l-a-bb-l-e--- 1) What about the remaining preds[26,:] up to preds[36,:] ? the length of the word is 26. @meijieru 2) You said that : "len(preds) is always 26", l am not sure about that. l can have a word with length 35, 42, 50..whatever . how do you deal with words with these lengths ?

Thanks a lot

random123user commented 7 years ago

Hey @ahmedmazari-dhatim ,

Thanks for the reply. I really have no idea what will happen if the length of the word increases beyond 26.

I am trying to use this code to improve accuracy in ICDAR 2015 recognition dataset. But, the data set contains some vertical and inverted words. So, for inverted text, my initial approach was to get the recognition confidence value in two possible rotations (0 deg and 180 deg), compare them and determine the correct orientation.

But since confidence values are close I am not able to get any comparison criteria in them. Sorry for asking a different question than the discussed one, but can you please tell me if there is any way to do this comparison. Or is there any other repository which deals with the problem of vertical/inverted text recognition?

ahmedmazari-dhatim commented 7 years ago

Hey @random123user @hellbago @meijieru ,

Starting from your question, l think the most trivial way to do that is to apply rotation in order to get the sequence in horizontal format respecting the height= 32. Otherwise, you have to adapt the CRNN to any height variable.

Coming back to my first question : Even if the length of word is less or equal to 26 , l 'm wondering What the remaining preds[26,:] up to preds[36,:] represents ?

my question is related to your answer as follow :

**```

Now, have a look at preds[0][0] - It will give you 37 different numbers. Important is one which is highest. For preds[0][0] the highest number is -88.9130 which is at 12th position in this list. Hence in recognized word first character is 12th character of alphabet '0123456789abcdefghijklmnopqrstuvwxyz" (space being the first character of alphabet) which is 'a'.

consider 7th character of "a-----v--a-i-l-a-bb-l-e---" which is 'v'. Have a look at preds[7][0]. The maximum number is the 32nd number. Hence, 7th character in recognized string is the 32nd character of alphabet which is "v".


l would be very grateful if you can add extra information and correct me if l'm wrong.

 Thank you
ahmedmazari-dhatim commented 7 years ago

Hi @meijieru ,

Why the logits are all negative ? it means that the most high probability value we can get is 0.5 wich matches with logit= 0. It's not so low to get the most probable sequence at 0.5 ?

@hellbago , @random123user : for ctcloss please look at this tutorial . It explains well how ctc works

https://github.com/SeanNaren/warp-ctc/blob/pytorch_bindings/torch_binding/TUTORIAL.md

Thank you

ahmedmazari-dhatim commented 7 years ago

Hi @random123user

to answer your question about why you got high proba l tried your code l got the following

what's wrong with my model


m=torch.nn.Softmax()
  model.eval()
  preds = model(image)
  temps=preds.cpu()
  prob=torch.max(m(temps)*100)

error with prob variable

assert input.dim() == 2, 'Softmax requires a 2D tensor as input'
AssertionError: Softmax requires a 2D tensor as input

or you used :

preds = model(image)
preds=preds[:,0,:] # to get  a two d vector ?
temps=preds
rohun-tripathi commented 6 years ago

By the way, where are the values that controls the number of output labels? For clarification, I understand that the alphabet has 37 labels. I am curious as to why 26 outputs are given by the system on the default configuration.

wanhaipeng commented 6 years ago

@hellbago why are you pred dim is 38 1 37?Mine is 26 1 37