Open gaozhiguang opened 3 years ago
The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.
For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.
Hi, thanks. but i am still a little confused about what is a PTB tokenizer? the code in train_utils.py uses the stanford corenlp for tokenizer. In addition, for some case, the function "get_tokenized_text(input_text, nlp):" will cause an error, i believe the reason is that: tokenized_json = nlp.annotate(input_text, properties={'annotators': 'tokenize', 'outputFormat': 'json', 'ssplit.isOneSentence': True}) code above cannot run properly, do you know why? Thanks again.
Ah yes, you are right! There's no need to tokenize if you use that script.
Can you send specific examples on which you can that error?
Hi, this will cause an error,
from pycorenlp import StanfordCoreNLP nlp = StanfordCoreNLP('http://localhost:9000')
line="teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 % of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones"
parse = nlp.annotate(line, properties={'annotators': 'tokenize,ssplit,pos,depparse', 'outputFormat': 'json', ... 'ssplit.isOneSentence': True})
print(parse)
Hi, i have another try, when i set: line = "teams are scouring the depths of a remote part of the southern indian ocean . so far , they 've covered about 60 of the priority search zone without reporting any trace of the airliner . families of passengers and crew members still have no answers about what happened to their loved ones" it works well. the difference is "60 %" with the symbol "%" , and when i change to "60", it can work.
The instructions are included in the 'Running pre-trained factuality models' section of the readme. Set $MODEL_TYPE to 'electra_sentence'. $INPUT_DIR should point to the location of the model.
For CNN specifically, lower case the input article and the summary, and run it through a PTB tokenizer.
Hi, the instructions included in the 'Running pre-trained factuality models' are for preprocessed dev files, i don't know how to process my data to that format. Thanks.
Hi, there are detailed instructions further down in the readme for how to run on non-preprocessed data.
But, very briefly, you can run to evaluate non-preprocessed summaries:
python3 evaluate_generated_outputs.py \
--model_type electra_dae \
--model_dir $MODEL_DIR \
--input_file sample_test.txt
The format of sample_test is included in the READme, as are some additional information, such as lowercasing and tokenization requirements.
Hi, thanks for the nice work. How can I use the ENT-C_sent-factuality model that trained on the data synthesized by cnndm for the non-preprocessed (input, summary) pairs. Thanks again.