It seems not ignoring the entire context tokens for the NLL loss computation might lead to inaccurate results.
Could you please provide some insights or clarification on this matter?
In addition, the current codes only use the n_samples_to_use = 2000 samples for P(True) baseline.
Do the experiment settings for the P(True) and the others different? I don't recall reading any explanation in the paper.
Hi @lorenzkuhn,
I wanted to bring to your attention a potential error in the computation of the P(True) baseline, unless I have misunderstood something here.
Currently, in the code snippet, only the first
len(tokenized_base_prompt)
targets are to -100: https://github.com/lorenzkuhn/semantic_uncertainty/blob/27adbf0dc1bf056c771c205d89c2a79cbd82dc3a/code/get_prompting_based_uncertainty.py#L108-L113However, it seems that this approach does not ignore the entire context tokens when calculating the NLL loss, as the
prompt_true
also includes thefew_shot_prompt
prior to thebase_prompt
: https://github.com/lorenzkuhn/semantic_uncertainty/blob/27adbf0dc1bf056c771c205d89c2a79cbd82dc3a/code/get_prompting_based_uncertainty.py#L105-L106It seems not ignoring the entire context tokens for the NLL loss computation might lead to inaccurate results. Could you please provide some insights or clarification on this matter?
In addition, the current codes only use the
n_samples_to_use = 2000
samples for P(True) baseline. Do the experiment settings for the P(True) and the others different? I don't recall reading any explanation in the paper.