Running the code on custom dataset without the FEVER DB file

vnik18 commented 3 years ago

Hi,

Is it possible to run the masker-corrector module of this code, without using the FEVER sqlite3 database file, in the following code file src/error_correction/modelling/error_correction_module.py?

I have my own dataset with the evidence text already retrieved, so I am hoping to avoid the step of retrieving information from the FEVER database. By any chance, are any intermediate output files generated after text has been retrieved from the FEVER database, that I can look at?

Thank you!

j6mes commented 3 years ago

Hi, the intermediate output from the maskers (with IR selected evidence) are added to the Google drive folder. With IR evidence, they don't actually need the FEVER database, but the dataset loader opens the database connection anyway. An easy fix is to comment out line 27 on the mask_based_correction_reader file. I'll make a change to only load it if needed soon.

vnik18 commented 3 years ago

@j6mes Okay. Considering this example from the file heuristic_gold_dev_genre_50_2.jsonl :

{"mutated": "Exercise is bad for heart health.", "original": "Exercise is good for heart health.", "mutation": "substitute_similar", "claim_id": 3518, "original_id": 3517, "sentence_id": 1542, "verdict": "REFUTES", "evidence": [{"annotation_id": 11271, "verdict_id": 14203, "page": "Heart", "line": 19}], "pipeline_text": [["Physical exercise", "non-pharmaceutical sleep aid to treat diseases such as insomnia , help promote or maintain positive self-esteem , improve mental health , maintain steady digestion and treat constipation and gas , regulate fertility health , and augment an individual 's sex appeal or body image , which has been found to"], ["Physical exercise", "be linked with higher levels of self-esteem . Childhood obesity is a growing global concern , and physical exercise may help decrease some of the effects of childhood and adult obesity . Some care providers call exercise the `` miracle '' or `` wonder '' drug -- alluding to the"]], "original_claim": "Exercise is bad for heart health .", "master_explanation": [2, 3, 4]}

Is the evidence text stored in the field pipeline_text of the intermediate masker output file?

If yes, does this mean that the fields claim_id, original_id, sentence_id and evidence (which includes annotation_id, verdict_id, page, line) are not useful once the evidence text has been extracted and written into the pipeline_text field? Can these fields be removed or replaced with dummy values?

j6mes commented 3 years ago

if pipeline_evidence is set (a list of 2-tuples (page name, + text)), the evidence field isn't used by the dataset loader

vnik18 commented 3 years ago

@j6mes Do you mean pipeline_text and not pipeline_evidence? In the above example, it is already in the form of a list of lists with 2 elements each: page name and text, so I will try using the same format for my own data.

Also, what about the fields claim_id, original_id and sentence_id? Are they used by the dataset loader?

j6mes commented 3 years ago

Yes, i meant pipeline_text. I think any extra values are just passed through to the metadata field and are ignored by the model

vnik18 commented 3 years ago

Thank you!

vnik18 commented 3 years ago

@j6mes Hi, I have a couple of questions about the data format of your model. In the following example,

{"prediction": "correction: Penguin Books revolutionized publishing in the 1940s.", "actual": "correction: Penguin Books revolutionized publishing in the 1920s.", 
"metadata": {"source": "Penguin Books [MASK] publishing in the [MASK] .", "target": "Penguin Books revolutionized publishing in the 1920s .", 
"evidence": "title: Penguin Books context: Penguin Books is a British publishing house . Penguin revolutionised publishing in the 1930s through its inexpensive paperbacks , sold through Woolworths ### title: Penguin Books context: '' , now the `` Big Five '' . Penguin Books is a British publishing house . It was founded in 1935 by Sir Allen Lane as a line of the publishers The Bodley Head , only becoming a separate company the following year . Penguin revolutionised publishing in the", 
"mutation_type": "substitute_similar", "veracity": "REFUTES"}}

What does the field "actual: correction" mean? I assumed it would be the correct statement that the model should have generated, but instead it is the mutated, incorrect version of the correct statement.

Also, the 'source' field contains the masked sentence that is input to the model. But the 'target' field contains the incorrect, mutated sentence and not the correct sentence that the model is supposed to learn to generate. In this case, does the model never see the correct version of the mutated/masked statement, except in the evidence?

Thank you.

j6mes commented 3 years ago

There's a few caveats to this. For the distant supervision objective, it's assume that the model doesn't have access to the reference correction, instead, it's trying to recover the input sentence as an auto-encoder. For scoring, we have to use the info in the metadata to compare against what was predicted and what the claim was before correction. I'll see if i can make this clearer in the documentation. I had to do a lot of cleaning before making the repo public and perhaps there's an easier way i can present all this info and ensure that it's consistent

vnik18 commented 3 years ago

@j6mes I see. So does this mean that both the "actual: correction" field and 'target' field from the above example contain the incorrect, mutated version of the input statement? If I have access to the correct reference statement, can I provide it as input to the model as part of training? If yes, how could I do that?

j6mes commented 3 years ago

There's a supervised version as well which doesn't use any masking (see finetune_supervised.sh and finetune_supervised_pipeline.sh) if you want to mix supervision and masks, you could either train a supervised model first, then fine-tune on masks. or combine the supervised and mask_based_reader from this folder to make a reader that understands both tasks https://github.com/j6mes/2021-acl-factual-error-correction/blob/main/src/error_correction/modelling/reader/supervised_correction_reader.py

vnik18 commented 3 years ago

@j6mes Thank you for replying. I will look into the supervised version. Regarding the mask_based_correction_reader.py file, I have the following question regarding the below code snippet:

claim_tokens = instance["original_claim"].split()
masked_claim = (
            instance["master_explanation"]
            if "master_explanation" in instance
            else instance["claim_tokens"]
        )
a = {
            "source": " ".join(
                [
                    token if idx not in masked_claim else "[MASK]"
                    for idx, token in enumerate(claim_tokens)
                ]
            ),
            "target": " ".join(claim_tokens),}

Both the source and the target fields of the training data are coming from the variable instance['original_claim'], which in turn contains the mutated version of the input sentence.

So it seems that the model being trained in the masked version never has access to the correct reference sentence. In such a case, could you please clarify how it could make a correction to a masked input sentence at test time? Would it just use information from the evidence for this?

j6mes / acl2021-factual-error-correction

Running the code on custom dataset without the FEVER DB file #2