eval_func - Githubissues

tinaboya2023 commented 1 year ago

Hi, in calculate_acc_score function it seems, you calculate evaluation with only sum and average for accuracy_score in python but in fact for TextVQA maybe you should calculate evaluation with below function. Of course may be I'm wrong.

Acc(ans) = min{(ha/3),1}

uakarsh commented 1 year ago

Hi @tinaboya2023, thanks for your reply. Of course, the metric used for evaluation is not correct, since at the time when I was implementing it, I had no clue about metric evaluation.

I have found ANLS, and I will include it shortly in the scripts Currently, you know, just life. Could you help me, with the scripts. I wanted to ask you, if the below code for ANLS is correct?


## ANLS Calculations
## Ref: https://github.com/huggingface/evaluate/pull/413/files

_CITATION = """\
@article{,
    title = {Binary codes capable of correcting deletions, insertions, and reversals},
    journal = {Soviet physics doklady},
    volume = {10},
    number = {8},
    pages = {707--710},
    year = {1966},
    url = {https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf},
    author = {V. I. Levenshtein},
}
"""

_DESCRIPTION = """\
ANLS refer to the average normalized Levenshtein similarity.
"""

_KWARGS_DESCRIPTION = """
Computes Average Normalized Levenshtein Similarity (ANLS).
Args:
    predictions: List of question-answers dictionaries with the following key-values:
        - 'question_id': id of the question-answer pair as given in the references (see below)
        - 'prediction_text': the text of the answer
    references: List of question-answers dictionaries with the following key-values:
        - 'question_id': id of the question-answer pair (see above),
        - 'answers': list of possible texts for the answer, as a list of strings

Returns:
    'anls': The ANLS score of predicted tokens versus the gold answer
Examples:
    >>> predictions = [{'prediction_text': 'Denver Broncos', 'question_id': '56e10a3be3433e1400422b22'}]
    >>> references = [{'answers': ['Denver Broncos', 'Denver R. Broncos'], 'question_id': '56e10a3be3433e1400422b22'}]
    >>> anls_metric = evaluate.load("anls")
    >>> results = anls_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'anls_score': 100.0}
"""

def compute_score(predictions, ground_truths):
    theta = 0.5
    anls_score = 0
    count = 0
    for qid, prediction in predictions.items():
        max_value = 0
        if qid in ground_truths:
            for x in ground_truths[qid]:
                nl = 1 - ratio(prediction, x)
                if nl < theta:
                    score = 1 - nl
                    if score > max_value:
                        max_value = score
            count += 1
            anls_score += max_value

    return anls_score / count

class Anls(evaluate.Metric):
    def _info(self):
        return evaluate.MetricInfo(
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            features=datasets.Features(
                {
                    "predictions": {"question_id": datasets.Value("string"),
                                    "prediction_text": datasets.Value("string")},
                    "references": {
                        "question_id": datasets.Value("string"),
                        "answers": datasets.features.Sequence(datasets.Value("string")),
                    },
                }
            )
        )

    def _compute(self, predictions, references):
        ground_truths = {x['question_id']: x['answers'] for x in references}
        predictions = {x['question_id']: x['prediction_text'] for x in predictions}
        anls_score = compute_score(predictions=predictions, ground_truths=ground_truths)
        return {"anls_score": anls_score}```

tinaboya2023 commented 1 year ago

Hi again, I'm so sorry for delay reply. Yes of course, I think if you could do ANLS's accuracy in your source code, your accuracy could be better. About your source, I think you doesn't mention 3 true answer of 10 answers. May be I'm in wrong.

But I think according M4C , you should add EvalAIAnswerProcessor then according TextVQAAccuracy class you can use this metrics. I use this codes in your source code but my problem is I don't know how to convert pred answer and gt answer in your code from tensor to string (because I think for TextVQAAccuracy you need the state of the string I write change of codes in below and have some error. Could you help me to correct errors. Of course may be generally I'm in wrong

`from m4c_evaluators import TextVQAAccuracyEvaluator evaluator = TextVQAAccuracyEvaluator() class LaTrForVQA(pl.LightningModule):

def training_step(self, batch, batch_idx): answer_vector = self.forward(batch) label2id = {}
for ans in tqdm(answer_vector): for word in ans.split(" "):

    if word not in label2id:
      label2id[word] = {'id':current_word_id, 'count': 1}
      current_word_id+=1
    else:
      label2id[word]['count']+=1

id2label = ["" for _ in range(current_word_id)]

convert_token_to_answer(batch['answer'], id2label)

predictions = []
for (pred, gt) in zip(answer_vector, batch['answer']):       
      predictions.append({
          "pred_answer": pred.detach().cpu(),
          "gt_answers": gt.detach().cpu()
          })
accuracy = evaluator.eval_pred_list(predictions)
accuracy = torch.tensor(accuracy).cuda()
loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))   
_, preds = torch.max(answer_vector, dim = -1)

train_acc = accuracy(preds, batch['answer'])
train_acc = torch.tensor(train_acc)

self.log('train_ce_loss', loss,prog_bar = True)
self.log('train_acc', train_acc, prog_bar = True)
self.training_losses.append(loss.item())

return loss

`

FrankZxShen commented 1 year ago

Hi @tinaboya2023，I tried to make some changes to the evaluation method according to your idea, but the effect was poor. Could you please help me to check the problems in my method？May be I'm wrong.

from m4c_evaluators import TextVQAAccuracyEvaluator
evaluator = TextVQAAccuracyEvaluator()
predictions = []
def calculate_acc_score(pred, gt):
    ## Function ignores the calculation of padding part
    ## Shape (seq_len, seq_len)
    mask = torch.clamp(gt, min = 0, max = 1)
    last_non_zero_argument = (mask != 0).nonzero()[1][-1]
    pred = pred[:last_non_zero_argument]
    gt = gt[:last_non_zero_argument]  ## Include all the arguments till the first padding index

    pred_answer = convert_token_to_ques(pred, tokenizer)
    gt_answer = convert_token_to_ques(gt, tokenizer)
    predictions.append({
                "pred_answer": pred_answer,
                "gt_answers": gt_answer,
            })

def calculate_metrics(self, prediction, labels):
    ## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
    batch_size = len(prediction)
    ac_score = 0
    for (pred, gt) in zip(prediction, labels):
        calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
    ac_score = evaluator.eval_pred_list(predictions)
    ac_score = torch.tensor(ac_score).cuda()
    return ac_score

uakarsh commented 1 year ago

Hi @FrankZxShen and @tinaboya2023, sorry for the late reply. I have been a bit busy because of some of my work.

I would be again starting to modify this repository since there are a few things, which I felt were missing. Of course, the main reason was, that I had been just exploring to implement the models, but had no idea about the evaluation criteria of TextVQA.

Here is what, I am planning now. Currently, the entire approach is abstractive in nature, but I believe the authors have taken a generative approach since it removes a lot of data preparation steps as well as the problems of finding the answers not present in the context.

Our model is generative in nature and as such alleviates the problem of vocabulary reliance current methods suffer from

Earlier, I wasn't even sure about how to use T5 for it, but now, I implemented a similar paper, and after going through T5's implementation from Hugging Face, I guess I can improve it.

I would also shortly try to add the pre-training script, as I was able to prepare a subset of the IDL Dataset for pre-training. I know, it is a bit of a big task, but I would try to add it.

And as mentioned, I am trying to learn on the way, and would really love to know your suggestions and comments.

For the remark made by @tinaboya2023, I would have a look at the metric and would let you know soon. Had been busy learning stuff, and studying, so delayed a lot. Again apologies and looking forward to implementing the repo.

tinaboya2023 commented 1 year ago

Hi @FrankZxShen Could I ask 2 question? 1) When you run with this changes Do you encounter with below error? If yes, how you manage this error?

I change len(answers) in m4c_evaluator.py but can’t solve it.

AssertionError Traceback (most recent call last) [<ipython-input-38-324898bb4d9a>](https://localhost:8080/#) in <cell line: 1>() --> 1 trainer.fit(model,datamodule) 14 frames [/content/m4c_evaluators.py](https://localhost:8080/#) in _compute_answer_scores(self, raw_answers) 224 225 answers = [self.answer_processor(a) for a in raw_answers] -->= 226 assert len(answers) == 10 227 gt_answers = list(enumerate(answers)) 228 unique_answers = set(answers) AssertionError:

2) And why you use convert_token_to_ques instead of convert_token_to_answer ? while they are different from each other and may you reach the better answer.

tinaboya2023 commented 1 year ago

Hi @uakarsh

Thank you for your follow up and I am very eager to see the modifies soon. It seems that if these changes are applied, the accuracy will be much higher.

uakarsh commented 1 year ago

Hi @FrankZxShen Could I ask 2 question?

When you run with this changes Do you encounter with below error? If yes, how you manage this error?

I change len(answers) in m4c_evaluator.py but can’t solve it.

AssertionError Traceback (most recent call last) [<ipython-input-38-324898bb4d9a>](https://localhost:8080/#) in <cell line: 1>() --> 1 trainer.fit(model,datamodule) 14 frames [/content/m4c_evaluators.py](https://localhost:8080/#) in _compute_answer_scores(self, raw_answers) 224 225 answers = [self.answer_processor(a) for a in raw_answers] -->= 226 assert len(answers) == 10 227 gt_answers = list(enumerate(answers)) 228 unique_answers = set(answers) AssertionError:

And why you use convert_token_to_ques instead of convert_token_to_answer ? while they are different from each other and may you reach the better answer.

Hi @tinaboya2023, for your 2nd question, I believe it is the correct approach, to use convert_token_to_ques, since the function convert_token_to_answer, was actually used because I treated it as a classification problem, and hence assigned classes to each of the word, which is incorrect. This happened, because of my mistake in framing the problem, especially in the decoding section. Once, I asked the authors, I got to know my mistakes.

FrankZxShen commented 1 year ago

Hey @tinaboya2023 For the first question, I use # assert len(answers) == 10, but I'm not sure if what I'm doing is correct. I'm still learning about TextVQAaccuracy, there's something I don't understand.

FrankZxShen commented 1 year ago

Thank you for your work! @uakarsh I hope the problem can be resolved soon.

tinaboya2023 commented 1 year ago

Hi @FrankZxShen Thank you for your replay . You know for my first question, I encountered above error and still I encountered this error too or not?
For my second question, we have to keep in mind that we are dealing with 10 separate answers and not one answer or one question. Of course, I may be wrong.

FrankZxShen commented 1 year ago

Hi again @tinaboya2023 This is my change to _compute_answer_scores.

def _compute_answer_scores(self, raw_answers):
        """
        compute the accuracy (soft score) of human answers
        """
        answers = [self.answer_processor(a) for a in raw_answers]
        # assert len(answers) == 10
        gt_answers = list(enumerate(answers))
        unique_answers = set(answers)
        unique_answer_scores = {}

        for unique_answer in unique_answers:
            accs = []
            for gt_answer in gt_answers:
                other_answers = [
                    item for item in gt_answers if item != gt_answer
                ]
                matching_answers = [
                    item for item in other_answers if item[1] == unique_answer
                ]
                acc = min(1, float(len(matching_answers)) / 3)
                accs.append(acc)
            unique_answer_scores[unique_answer] = sum(accs) / len(accs)

        return unique_answer_scores

Of course, we can expect changes from the repo author. I'm not sure that's the right thing to do in response to your questions.

tinaboya2023 commented 1 year ago

Hi Mr @FrankZxShen I'm so so sorry for delay replay, because of my some problem. Could I ask a question? Why you remove len(answers) == 10 while it is crucial data for us?

FrankZxShen commented 1 year ago

Hi @tinaboya2023 Thank you for your reply. I'm very sorry. It should be my mistake. I don't know how to modify it :(

tinaboya2023 commented 1 year ago

Hi @FrankZxShen You know I run this code with your changes and accuracy became better than before. But I think we are making mistakes and we are ignoring some facts and that fact is to select 3 corrected answers(maximum pred answer)of all of 10 answer in total and we have not any function in code to find number of true answers. You know I think we should have a like model_output["scores"] in metrics.py to find number of corrected answer. May be Mr @uakarsh understand better what I mean and explain us.

uakarsh commented 1 year ago

I had the same question, like when you see here, in the last part of the notebook, there are multiple answers, and I am not sure how to do it. I figured encoding context and questions in a single sentence (since there is a need to include the question and the context as well), however, still not sure about how to encode the answer.

By the way, I realized that there is a need to rewrite a part of the codebase (since I complicated stuff earlier), however, I will complete it. I found some references here, maybe it can help in the evaluation part. Could you suggest any way, in which we can encode the answer, and evaluate it? Once that is clear, I can handle the other part.

Here is what I am thinking, take one of the answers from the given answer, as done here. Maybe, is it good to go? I would be going through this repo, to know more.

Any comments @tinaboya2023 @FrankZxShen?

uakarsh commented 1 year ago

Hi @FrankZxShen @tinaboya2023, I tried implementing the metrics (since the other parts have been completed). It is here, like @FrankZxShen mentioned, I commented on the assert part, still am not getting 100% accuracy (I tried, with giving the same label as ground truth and predicted answer. The code is taken from here, although it is the same as that of mmf, I believe there is some mistake in the implementation of the metric, I would be looking to it. Since, if this is done, I guess all the parts can be integrated easily to train and see the results.

tinaboya2023 commented 1 year ago

Hi @uakarsh , @FrankZxShen I'm so sorry for delay reply Thank you so much for updating your source code. It makes more sense. but it seems accuracy doesn't become better. I think it is because of not using vocab. How could we use vocab files like THIS function of this code.

I have another question too. About watching step by step that you wrote in class LaTrForVQA(pl.LightningModule): with training_step and optimizer_step in your previous code

## https://stackoverflow.com/questions/69899602/linear-decay-as-learning-rate-scheduler-pytorch
def polynomial(base_lr, iter, max_iter = 1e5, power = 1):
    return base_lr * ((1 - float(iter) / max_iter) ** power)

class LaTrForVQA(pl.LightningModule):
  def __init__(self, config , learning_rate = 1e-4, max_steps = 100000//2):
    super(LaTrForVQA, self).__init__()

    self.config = config
    self.save_hyperparameters()
    self.latr =  LaTr_for_finetuning(config)
    self.training_losses = []
    self.validation_losses = []
    self.max_steps = max_steps

  def configure_optimizers(self):
    return torch.optim.AdamW(self.parameters(), lr = self.hparams['learning_rate'])

  def forward(self, batch_dict):
    boxes =   batch_dict['boxes']
    img =     batch_dict['img']
    question = batch_dict['question']
    words =   batch_dict['tokenized_words']
    answer_vector = self.latr(lang_vect = words, 
                               spatial_vect = boxes, 
                               img_vect = img, 
                               quest_vect = question
                               )
    return answer_vector

  def calculate_metrics(self, prediction, labels):

      ## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
      batch_size = len(prediction)
      ac_score = 0

      for (pred, gt) in zip(prediction, labels):
        ac_score+= calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
      ac_score = ac_score/batch_size
      return ac_score

  def training_step(self, batch, batch_idx):
    answer_vector = self.forward(batch)

    ## https://discuss.huggingface.co/t/bertformaskedlm-s-loss-and-scores-how-the-loss-is-computed/607/2
    loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
    _, preds = torch.max(answer_vector, dim = -1)

    ## Calculating the accuracy score
    train_acc = self.calculate_metrics(preds, batch['answer'])
    train_acc = torch.tensor(train_acc)

    ## Logging
    self.log('train_ce_loss', loss,prog_bar = True)
    self.log('train_acc', train_acc, prog_bar = True)
    self.training_losses.append(loss.item())

    return loss

  def validation_step(self, batch, batch_idx):
    logits = self.forward(batch)
    loss = nn.CrossEntropyLoss()(logits.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
    _, preds = torch.max(logits, dim = -1)

    ## Validation Accuracy
    val_acc = self.calculate_metrics(preds.cpu(), batch['answer'].cpu())
    val_acc = torch.tensor(val_acc)

    ## Logging
    self.log('val_ce_loss', loss, prog_bar = True)
    self.log('val_acc', val_acc, prog_bar = True)

    return {'val_loss': loss, 'val_acc': val_acc}
  ## For the fine-tuning stage, Warm-up period is set to 1,000 steps and again is linearly decayed to zero, pg. 12, of the paper
  ## Refer here: https://github.com/Lightning-AI/lightning/issues/328#issuecomment-550114178

  def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, opt_closure = None, on_tpu=False,
    using_native_amp=False, using_lbfgs=False):

        ## Warmup for 1000 steps
        if self.trainer.global_step < 1000:
            lr_scale = min(1., float(self.trainer.global_step + 1) / 1000.)
            for pg in optimizer.param_groups:
                pg['lr'] = lr_scale * self.hparams.learning_rate

        ## Linear Decay
        else:
            for pg in optimizer.param_groups:
                pg['lr'] = polynomial(self.hparams.learning_rate, self.trainer.global_step, max_iter = self.max_steps)

        optimizer.step(opt_closure)
        optimizer.zero_grad()

So now you should change something parts of this class. For example self.latr = LaTr_for_finetuning(config) . What change can be made in this part to see the result?

uakarsh commented 1 year ago

Hi @tinaboya2023 , I did try to make a set of notebooks here: https://github.com/uakarsh/latr/tree/main/examples/new_textvqa

I think, it should be helpful

tinaboya2023 commented 1 year ago

Hi again @uakarsh , Thank you for your reply I think in any file you didn't mention vocab file and in any of them you didn't run class LaTrForVQA(pl.LightningModule) like your previous code, in order to run tensorboard and only calculated accuracy. For example how can I change following line code? (Or maybe I'm wrong) self.latr = LaTr_for_finetuning(config)

uakarsh / latr

eval_func #13