Open tinaboya2023 opened 1 year ago
Hi @tinaboya2023, thanks for your reply. Of course, the metric used for evaluation is not correct, since at the time when I was implementing it, I had no clue about metric evaluation.
I have found ANLS, and I will include it shortly in the scripts Currently, you know, just life. Could you help me, with the scripts. I wanted to ask you, if the below code for ANLS is correct?
## ANLS Calculations
## Ref: https://github.com/huggingface/evaluate/pull/413/files
_CITATION = """\
@article{,
title = {Binary codes capable of correcting deletions, insertions, and reversals},
journal = {Soviet physics doklady},
volume = {10},
number = {8},
pages = {707--710},
year = {1966},
url = {https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf},
author = {V. I. Levenshtein},
}
"""
_DESCRIPTION = """\
ANLS refer to the average normalized Levenshtein similarity.
"""
_KWARGS_DESCRIPTION = """
Computes Average Normalized Levenshtein Similarity (ANLS).
Args:
predictions: List of question-answers dictionaries with the following key-values:
- 'question_id': id of the question-answer pair as given in the references (see below)
- 'prediction_text': the text of the answer
references: List of question-answers dictionaries with the following key-values:
- 'question_id': id of the question-answer pair (see above),
- 'answers': list of possible texts for the answer, as a list of strings
Returns:
'anls': The ANLS score of predicted tokens versus the gold answer
Examples:
>>> predictions = [{'prediction_text': 'Denver Broncos', 'question_id': '56e10a3be3433e1400422b22'}]
>>> references = [{'answers': ['Denver Broncos', 'Denver R. Broncos'], 'question_id': '56e10a3be3433e1400422b22'}]
>>> anls_metric = evaluate.load("anls")
>>> results = anls_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'anls_score': 100.0}
"""
def compute_score(predictions, ground_truths):
theta = 0.5
anls_score = 0
count = 0
for qid, prediction in predictions.items():
max_value = 0
if qid in ground_truths:
for x in ground_truths[qid]:
nl = 1 - ratio(prediction, x)
if nl < theta:
score = 1 - nl
if score > max_value:
max_value = score
count += 1
anls_score += max_value
return anls_score / count
class Anls(evaluate.Metric):
def _info(self):
return evaluate.MetricInfo(
description=_DESCRIPTION,
citation=_CITATION,
inputs_description=_KWARGS_DESCRIPTION,
features=datasets.Features(
{
"predictions": {"question_id": datasets.Value("string"),
"prediction_text": datasets.Value("string")},
"references": {
"question_id": datasets.Value("string"),
"answers": datasets.features.Sequence(datasets.Value("string")),
},
}
)
)
def _compute(self, predictions, references):
ground_truths = {x['question_id']: x['answers'] for x in references}
predictions = {x['question_id']: x['prediction_text'] for x in predictions}
anls_score = compute_score(predictions=predictions, ground_truths=ground_truths)
return {"anls_score": anls_score}```
Hi again, I'm so sorry for delay reply. Yes of course, I think if you could do ANLS's accuracy in your source code, your accuracy could be better. About your source, I think you doesn't mention 3 true answer of 10 answers. May be I'm in wrong.
But I think according M4C , you should add EvalAIAnswerProcessor then according TextVQAAccuracy class you can use this metrics. I use this codes in your source code but my problem is I don't know how to convert pred answer and gt answer in your code from tensor to string (because I think for TextVQAAccuracy you need the state of the string I write change of codes in below and have some error. Could you help me to correct errors. Of course may be generally I'm in wrong
`from m4c_evaluators import TextVQAAccuracyEvaluator evaluator = TextVQAAccuracyEvaluator() class LaTrForVQA(pl.LightningModule):
def training_step(self, batch, batch_idx):
answer_vector = self.forward(batch)
label2id = {}
for ans in tqdm(answer_vector):
for word in ans.split(" "):
if word not in label2id:
label2id[word] = {'id':current_word_id, 'count': 1}
current_word_id+=1
else:
label2id[word]['count']+=1
id2label = ["" for _ in range(current_word_id)]
convert_token_to_answer(batch['answer'], id2label)
predictions = []
for (pred, gt) in zip(answer_vector, batch['answer']):
predictions.append({
"pred_answer": pred.detach().cpu(),
"gt_answers": gt.detach().cpu()
})
accuracy = evaluator.eval_pred_list(predictions)
accuracy = torch.tensor(accuracy).cuda()
loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
_, preds = torch.max(answer_vector, dim = -1)
train_acc = accuracy(preds, batch['answer'])
train_acc = torch.tensor(train_acc)
self.log('train_ce_loss', loss,prog_bar = True)
self.log('train_acc', train_acc, prog_bar = True)
self.training_losses.append(loss.item())
return loss
`
Hi @tinaboya2023,I tried to make some changes to the evaluation method according to your idea, but the effect was poor. Could you please help me to check the problems in my method?May be I'm wrong.
from m4c_evaluators import TextVQAAccuracyEvaluator
evaluator = TextVQAAccuracyEvaluator()
predictions = []
def calculate_acc_score(pred, gt):
## Function ignores the calculation of padding part
## Shape (seq_len, seq_len)
mask = torch.clamp(gt, min = 0, max = 1)
last_non_zero_argument = (mask != 0).nonzero()[1][-1]
pred = pred[:last_non_zero_argument]
gt = gt[:last_non_zero_argument] ## Include all the arguments till the first padding index
pred_answer = convert_token_to_ques(pred, tokenizer)
gt_answer = convert_token_to_ques(gt, tokenizer)
predictions.append({
"pred_answer": pred_answer,
"gt_answers": gt_answer,
})
def calculate_metrics(self, prediction, labels):
## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
batch_size = len(prediction)
ac_score = 0
for (pred, gt) in zip(prediction, labels):
calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
ac_score = evaluator.eval_pred_list(predictions)
ac_score = torch.tensor(ac_score).cuda()
return ac_score
Hi @FrankZxShen and @tinaboya2023, sorry for the late reply. I have been a bit busy because of some of my work.
I would be again starting to modify this repository since there are a few things, which I felt were missing. Of course, the main reason was, that I had been just exploring to implement the models, but had no idea about the evaluation criteria of TextVQA.
Here is what, I am planning now. Currently, the entire approach is abstractive in nature, but I believe the authors have taken a generative approach since it removes a lot of data preparation steps as well as the problems of finding the answers not present in the context.
Our model is generative in nature and as such alleviates the problem of vocabulary reliance current methods suffer from
Earlier, I wasn't even sure about how to use T5 for it, but now, I implemented a similar paper, and after going through T5's implementation from Hugging Face, I guess I can improve it.
I would also shortly try to add the pre-training script, as I was able to prepare a subset of the IDL Dataset for pre-training. I know, it is a bit of a big task, but I would try to add it.
And as mentioned, I am trying to learn on the way, and would really love to know your suggestions and comments.
For the remark made by @tinaboya2023, I would have a look at the metric and would let you know soon. Had been busy learning stuff, and studying, so delayed a lot. Again apologies and looking forward to implementing the repo.
Hi @FrankZxShen Could I ask 2 question? 1) When you run with this changes Do you encounter with below error? If yes, how you manage this error?
I change len(answers) in m4c_evaluator.py but can’t solve it.
AssertionError Traceback (most recent call last) [<ipython-input-38-324898bb4d9a>](https://localhost:8080/#) in <cell line: 1>() --> 1 trainer.fit(model,datamodule) 14 frames [/content/m4c_evaluators.py](https://localhost:8080/#) in _compute_answer_scores(self, raw_answers) 224 225 answers = [self.answer_processor(a) for a in raw_answers] -->= 226 assert len(answers) == 10 227 gt_answers = list(enumerate(answers)) 228 unique_answers = set(answers) AssertionError:
2) And why you use convert_token_to_ques
instead of convert_token_to_answer
? while they are different from each other and may you reach the better answer.
Hi @uakarsh
Thank you for your follow up and I am very eager to see the modifies soon. It seems that if these changes are applied, the accuracy will be much higher.
Hi @FrankZxShen Could I ask 2 question?
- When you run with this changes Do you encounter with below error? If yes, how you manage this error?
I change len(answers) in m4c_evaluator.py but can’t solve it.
AssertionError Traceback (most recent call last) [<ipython-input-38-324898bb4d9a>](https://localhost:8080/#) in <cell line: 1>() --> 1 trainer.fit(model,datamodule) 14 frames [/content/m4c_evaluators.py](https://localhost:8080/#) in _compute_answer_scores(self, raw_answers) 224 225 answers = [self.answer_processor(a) for a in raw_answers] -->= 226 assert len(answers) == 10 227 gt_answers = list(enumerate(answers)) 228 unique_answers = set(answers) AssertionError:
- And why you use
convert_token_to_ques
instead ofconvert_token_to_answer
? while they are different from each other and may you reach the better answer.
Hi @tinaboya2023, for your 2nd question, I believe it is the correct approach, to use convert_token_to_ques
, since the function convert_token_to_answer
, was actually used because I treated it as a classification problem, and hence assigned classes to each of the word, which is incorrect. This happened, because of my mistake in framing the problem, especially in the decoding section. Once, I asked the authors, I got to know my mistakes.
Hey @tinaboya2023
For the first question, I use # assert len(answers) == 10
, but I'm not sure if what I'm doing is correct. I'm still learning about TextVQAaccuracy, there's something I don't understand.
Thank you for your work! @uakarsh I hope the problem can be resolved soon.
Hi @FrankZxShen
Thank you for your replay .
You know for my first question, I encountered above error and still I encountered this error too or not?
For my second question, we have to keep in mind that we are dealing with 10 separate answers and not one answer or one question. Of course, I may be wrong.
Hi again @tinaboya2023
This is my change to _compute_answer_scores
.
def _compute_answer_scores(self, raw_answers):
"""
compute the accuracy (soft score) of human answers
"""
answers = [self.answer_processor(a) for a in raw_answers]
# assert len(answers) == 10
gt_answers = list(enumerate(answers))
unique_answers = set(answers)
unique_answer_scores = {}
for unique_answer in unique_answers:
accs = []
for gt_answer in gt_answers:
other_answers = [
item for item in gt_answers if item != gt_answer
]
matching_answers = [
item for item in other_answers if item[1] == unique_answer
]
acc = min(1, float(len(matching_answers)) / 3)
accs.append(acc)
unique_answer_scores[unique_answer] = sum(accs) / len(accs)
return unique_answer_scores
Of course, we can expect changes from the repo author. I'm not sure that's the right thing to do in response to your questions.
Hi Mr @FrankZxShen
I'm so so sorry for delay replay, because of my some problem.
Could I ask a question?
Why you remove len(answers) == 10
while it is crucial data for us?
Hi @tinaboya2023 Thank you for your reply. I'm very sorry. It should be my mistake. I don't know how to modify it :(
Hi @FrankZxShen
You know I run this code with your changes and accuracy became better than before.
But I think we are making mistakes and we are ignoring some facts and that fact is to select 3 corrected answers(maximum pred answer)of all of 10 answer in total and we have not any function in code to find number of true answers.
You know I think we should have a like model_output["scores"]
in metrics.py to find number of corrected answer.
May be Mr @uakarsh understand better what I mean and explain us.
I had the same question, like when you see here, in the last part of the notebook, there are multiple answers, and I am not sure how to do it. I figured encoding context and questions in a single sentence (since there is a need to include the question and the context as well), however, still not sure about how to encode the answer.
By the way, I realized that there is a need to rewrite a part of the codebase (since I complicated stuff earlier), however, I will complete it. I found some references here, maybe it can help in the evaluation part. Could you suggest any way, in which we can encode the answer, and evaluate it? Once that is clear, I can handle the other part.
Here is what I am thinking, take one of the answers from the given answer, as done here. Maybe, is it good to go? I would be going through this repo, to know more.
Any comments @tinaboya2023 @FrankZxShen?
Hi @FrankZxShen @tinaboya2023, I tried implementing the metrics (since the other parts have been completed). It is here, like @FrankZxShen mentioned, I commented on the assert
part, still am not getting 100% accuracy (I tried, with giving the same label as ground truth and predicted answer. The code is taken from here, although it is the same as that of mmf
, I believe there is some mistake in the implementation of the metric, I would be looking to it. Since, if this is done, I guess all the parts can be integrated easily to train and see the results.
Hi @uakarsh , @FrankZxShen I'm so sorry for delay reply Thank you so much for updating your source code. It makes more sense. but it seems accuracy doesn't become better. I think it is because of not using vocab. How could we use vocab files like THIS function of this code.
I have another question too. About watching step by step that you wrote in class LaTrForVQA(pl.LightningModule):
with training_step
and optimizer_step
in your previous code
## https://stackoverflow.com/questions/69899602/linear-decay-as-learning-rate-scheduler-pytorch
def polynomial(base_lr, iter, max_iter = 1e5, power = 1):
return base_lr * ((1 - float(iter) / max_iter) ** power)
class LaTrForVQA(pl.LightningModule):
def __init__(self, config , learning_rate = 1e-4, max_steps = 100000//2):
super(LaTrForVQA, self).__init__()
self.config = config
self.save_hyperparameters()
self.latr = LaTr_for_finetuning(config)
self.training_losses = []
self.validation_losses = []
self.max_steps = max_steps
def configure_optimizers(self):
return torch.optim.AdamW(self.parameters(), lr = self.hparams['learning_rate'])
def forward(self, batch_dict):
boxes = batch_dict['boxes']
img = batch_dict['img']
question = batch_dict['question']
words = batch_dict['tokenized_words']
answer_vector = self.latr(lang_vect = words,
spatial_vect = boxes,
img_vect = img,
quest_vect = question
)
return answer_vector
def calculate_metrics(self, prediction, labels):
## Calculate the accuracy score between the prediction and ground label for a batch, with considering the pad sequence
batch_size = len(prediction)
ac_score = 0
for (pred, gt) in zip(prediction, labels):
ac_score+= calculate_acc_score(pred.detach().cpu(), gt.detach().cpu())
ac_score = ac_score/batch_size
return ac_score
def training_step(self, batch, batch_idx):
answer_vector = self.forward(batch)
## https://discuss.huggingface.co/t/bertformaskedlm-s-loss-and-scores-how-the-loss-is-computed/607/2
loss = nn.CrossEntropyLoss()(answer_vector.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
_, preds = torch.max(answer_vector, dim = -1)
## Calculating the accuracy score
train_acc = self.calculate_metrics(preds, batch['answer'])
train_acc = torch.tensor(train_acc)
## Logging
self.log('train_ce_loss', loss,prog_bar = True)
self.log('train_acc', train_acc, prog_bar = True)
self.training_losses.append(loss.item())
return loss
def validation_step(self, batch, batch_idx):
logits = self.forward(batch)
loss = nn.CrossEntropyLoss()(logits.reshape(-1,self.config['classes']), batch['answer'].reshape(-1))
_, preds = torch.max(logits, dim = -1)
## Validation Accuracy
val_acc = self.calculate_metrics(preds.cpu(), batch['answer'].cpu())
val_acc = torch.tensor(val_acc)
## Logging
self.log('val_ce_loss', loss, prog_bar = True)
self.log('val_acc', val_acc, prog_bar = True)
return {'val_loss': loss, 'val_acc': val_acc}
## For the fine-tuning stage, Warm-up period is set to 1,000 steps and again is linearly decayed to zero, pg. 12, of the paper
## Refer here: https://github.com/Lightning-AI/lightning/issues/328#issuecomment-550114178
def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, opt_closure = None, on_tpu=False,
using_native_amp=False, using_lbfgs=False):
## Warmup for 1000 steps
if self.trainer.global_step < 1000:
lr_scale = min(1., float(self.trainer.global_step + 1) / 1000.)
for pg in optimizer.param_groups:
pg['lr'] = lr_scale * self.hparams.learning_rate
## Linear Decay
else:
for pg in optimizer.param_groups:
pg['lr'] = polynomial(self.hparams.learning_rate, self.trainer.global_step, max_iter = self.max_steps)
optimizer.step(opt_closure)
optimizer.zero_grad()
So now you should change something parts of this class. For example self.latr = LaTr_for_finetuning(config)
.
What change can be made in this part to see the result?
Hi @tinaboya2023 , I did try to make a set of notebooks here: https://github.com/uakarsh/latr/tree/main/examples/new_textvqa
I think, it should be helpful
Hi again @uakarsh ,
Thank you for your reply
I think in any file you didn't mention vocab file and in any of them you didn't run class LaTrForVQA(pl.LightningModule)
like your previous code, in order to run tensorboard and only calculated accuracy. For example how can I change following line code? (Or maybe I'm wrong)
self.latr = LaTr_for_finetuning(config)
Hi, in
calculate_acc_score
function it seems, you calculate evaluation with only sum and average for accuracy_score in python but in fact for TextVQA maybe you should calculate evaluation with below function. Of course may be I'm wrong.Acc(ans) = min{(ha/3),1}