Open vnik18 opened 3 years ago
Hi, the intermediate output from the maskers (with IR selected evidence) are added to the Google drive folder. With IR evidence, they don't actually need the FEVER database, but the dataset loader opens the database connection anyway. An easy fix is to comment out line 27 on the mask_based_correction_reader
file. I'll make a change to only load it if needed soon.
@j6mes Okay. Considering this example from the file heuristic_gold_dev_genre_50_2.jsonl
:
{"mutated": "Exercise is bad for heart health.", "original": "Exercise is good for heart health.", "mutation": "substitute_similar", "claim_id": 3518, "original_id": 3517, "sentence_id": 1542, "verdict": "REFUTES", "evidence": [{"annotation_id": 11271, "verdict_id": 14203, "page": "Heart", "line": 19}], "pipeline_text": [["Physical exercise", "non-pharmaceutical sleep aid to treat diseases such as insomnia , help promote or maintain positive self-esteem , improve mental health , maintain steady digestion and treat constipation and gas , regulate fertility health , and augment an individual 's sex appeal or body image , which has been found to"], ["Physical exercise", "be linked with higher levels of self-esteem . Childhood obesity is a growing global concern , and physical exercise may help decrease some of the effects of childhood and adult obesity . Some care providers call exercise the `` miracle '' or `` wonder '' drug -- alluding to the"]], "original_claim": "Exercise is bad for heart health .", "master_explanation": [2, 3, 4]}
Is the evidence text stored in the field pipeline_text
of the intermediate masker output file?
If yes, does this mean that the fields claim_id
, original_id
, sentence_id
and evidence
(which includes annotation_id
, verdict_id
, page
, line
) are not useful once the evidence text has been extracted and written into the pipeline_text
field?
Can these fields be removed or replaced with dummy values?
if pipeline_evidence
is set (a list of 2-tuples (page name, + text)), the evidence
field isn't used by the dataset loader
@j6mes Do you mean pipeline_text
and not pipeline_evidence
? In the above example, it is already in the form of a list of lists with 2 elements each: page name and text, so I will try using the same format for my own data.
Also, what about the fields claim_id
, original_id
and sentence_id
? Are they used by the dataset loader?
Yes, i meant pipeline_text. I think any extra values are just passed through to the metadata
field and are ignored by the model
Thank you!
@j6mes Hi, I have a couple of questions about the data format of your model. In the following example,
{"prediction": "correction: Penguin Books revolutionized publishing in the 1940s.", "actual": "correction: Penguin Books revolutionized publishing in the 1920s.",
"metadata": {"source": "Penguin Books [MASK] publishing in the [MASK] .", "target": "Penguin Books revolutionized publishing in the 1920s .",
"evidence": "title: Penguin Books context: Penguin Books is a British publishing house . Penguin revolutionised publishing in the 1930s through its inexpensive paperbacks , sold through Woolworths ### title: Penguin Books context: '' , now the `` Big Five '' . Penguin Books is a British publishing house . It was founded in 1935 by Sir Allen Lane as a line of the publishers The Bodley Head , only becoming a separate company the following year . Penguin revolutionised publishing in the",
"mutation_type": "substitute_similar", "veracity": "REFUTES"}}
What does the field "actual: correction" mean? I assumed it would be the correct statement that the model should have generated, but instead it is the mutated, incorrect version of the correct statement.
Also, the 'source' field contains the masked sentence that is input to the model. But the 'target' field contains the incorrect, mutated sentence and not the correct sentence that the model is supposed to learn to generate. In this case, does the model never see the correct version of the mutated/masked statement, except in the evidence?
Thank you.
There's a few caveats to this. For the distant supervision objective, it's assume that the model doesn't have access to the reference correction, instead, it's trying to recover the input sentence as an auto-encoder. For scoring, we have to use the info in the metadata to compare against what was predicted and what the claim was before correction. I'll see if i can make this clearer in the documentation. I had to do a lot of cleaning before making the repo public and perhaps there's an easier way i can present all this info and ensure that it's consistent
@j6mes I see. So does this mean that both the "actual: correction" field and 'target' field from the above example contain the incorrect, mutated version of the input statement? If I have access to the correct reference statement, can I provide it as input to the model as part of training? If yes, how could I do that?
There's a supervised version as well which doesn't use any masking (see finetune_supervised.sh and finetune_supervised_pipeline.sh) if you want to mix supervision and masks, you could either train a supervised model first, then fine-tune on masks. or combine the supervised and mask_based_reader from this folder to make a reader that understands both tasks https://github.com/j6mes/2021-acl-factual-error-correction/blob/main/src/error_correction/modelling/reader/supervised_correction_reader.py
@j6mes Thank you for replying. I will look into the supervised version. Regarding the mask_based_correction_reader.py
file, I have the following question regarding the below code snippet:
claim_tokens = instance["original_claim"].split()
masked_claim = (
instance["master_explanation"]
if "master_explanation" in instance
else instance["claim_tokens"]
)
a = {
"source": " ".join(
[
token if idx not in masked_claim else "[MASK]"
for idx, token in enumerate(claim_tokens)
]
),
"target": " ".join(claim_tokens),}
Both the source
and the target
fields of the training data are coming from the variable instance['original_claim']
, which in turn contains the mutated version of the input sentence.
So it seems that the model being trained in the masked version never has access to the correct reference sentence. In such a case, could you please clarify how it could make a correction to a masked input sentence at test time? Would it just use information from the evidence for this?
Hi,
Is it possible to run the
masker-corrector
module of this code, without using theFEVER sqlite3
database file, in the following code filesrc/error_correction/modelling/error_correction_module.py
?I have my own dataset with the evidence text already retrieved, so I am hoping to avoid the step of retrieving information from the FEVER database. By any chance, are any intermediate output files generated after text has been retrieved from the FEVER database, that I can look at?
Thank you!