How to run the trained factuality model(ENT-C_sent-factuality) on non-preprocessed (input, summary) pairs

tagoyal / factuality-datasets

45 stars 4 forks source link

How to run the trained factuality model(ENT-C_sent-factuality) on non-preprocessed (input, summary) pairs #3

Open gaozhiguang opened 3 years ago

gaozhiguang commented 3 years ago

Hi, thanks for the nice work. How can I use the ENT-C_sent-factuality model that trained on the data synthesized by cnndm for the non-preprocessed (input, summary) pairs. Thanks again.

tagoyal commented 3 years ago

The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.

For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.

gaozhiguang commented 3 years ago

Hi, thanks. but i am still a little confused about what is a PTB tokenizer? the code in train_utils.py uses the stanford corenlp for tokenizer. In addition, for some case, the function "get_tokenized_text(input_text, nlp):" will cause an error, i believe the reason is that: tokenized_json = nlp.annotate(input_text, properties={'annotators': 'tokenize', 'outputFormat': 'json', 'ssplit.isOneSentence': True}) code above cannot run properly, do you know why? Thanks again.

tagoyal commented 3 years ago

Ah yes, you are right! There's no need to tokenize if you use that script.

Can you send specific examples on which you can that error?

gaozhiguang commented 3 years ago

Hi, this will cause an error,

from pycorenlp import StanfordCoreNLP nlp = StanfordCoreNLP('http://localhost:9000')

line="teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 % of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones"

parse = nlp.annotate(line, properties={'annotators': 'tokenize,ssplit,pos,depparse', 'outputFormat': 'json', ... 'ssplit.isOneSentence': True})

print(parse)

Could not handle incoming annotation

gaozhiguang commented 3 years ago

Hi, i have another try, when i set: line = "teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones" it works well. the difference is "60 %" with the symbol "%" , and when i change to "60", it can work.

gaozhiguang commented 3 years ago

The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.

For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.

Hi, the instructions included in the 'Running pre-trained factuality models' are for preprocessed dev files, i don't know how to process my data to that format. Thanks.

tagoyal commented 3 years ago

Hi, there are detailed instructions further down in the readme for how to run on non-preprocessed data.

But, very briefly, you can run to evaluate non-preprocessed summaries:

python3 evaluate_generated_outputs.py \
        --model_type electra_dae \
        --model_dir $MODEL_DIR  \
        --input_file sample_test.txt

The format of sample_test is included in the READme, as are some additional information, such as lowercasing and tokenization requirements.