请问可以开源Headline Generation数据预处理部分的代码吗，不甚感谢！

n9705 commented 4 years ago

能否提供您提取的jieba用户字典呢？

junyann commented 4 years ago

Hi @n9705! Thanks for your interest! Unfortunately, I don’t have a script for the preprocessing pipeline as the practice is simple and I did most steps manually. The README for the headline generation experiment provides details about how to preprocess the dataset. For building the user dictionary, I just count the word frequencies of our processed People’s Daily Corpus and output the counter to a text file in the user dictionary format: each line contains a word and its frequency, which are separated by a space. Here is an example of the user dictionary provided by jieba. I also suggest using our preprocessed LCSTS dataset so that you don’t need to build the user dictionary and tokenize the raw data yourself. Feel free to let me know if anything is unclear or further help is needed.

n9705 commented 4 years ago

Thank you for your kind reply. @junyann I need to use the model to deal with other txt which not in LCSTS dataset, so I have to segment the content myself. I have already build the user dictionary for Jieba using the word frequencies of the all text in here. And I use jieba.cut(content, HMM=False) to segment the LCSTS dataset, but result was different to your preprocessed LCSTS dataset. The following is an example, the above sentence is the result of my method above of word segmentation, the next sentence is one of your preprocessed LCSTS test.article.txt. 本文总结了十个可穿戴产品的设计原则，而这些原则，同样也是笔者认为是这个行业最吸引人的地方： 1 . 为人们解决重复复性重复性问题； 2 . 从人开始，而不是从机器开始； 3 . 要引起注意，但不要刻意； 4 . 提升用户能力，而不是取代人 本文总结了十个可穿戴产品的设计原则，而这些原则，同样也是笔者认为是这个行业最吸引人的地方： 1 . 为人们解决重复性问题； 2 . 从人开始，而不是从机器开始； 3 . 要引起注意，但不要刻意； 4 . 提升用户能力，而不是取代人

n9705 commented 4 years ago

@junyann I will appreciate so much if you could give me more advice.

junyann commented 4 years ago

Hi @n9705! Thanks for providing the details and sorry for the late reply. I think the issue is how you use the user dictionary. Jieba has its built-in dictionary. Therefore, if you want to ensure that all segmentations are in your user dictionary, you need to: (1) replace the built-in dictionary with your user dictionary; (2) turn off HMM. Could you try running jieba.set_dictionary('user_dict.txt') before jieba.cut(content, HMM=False) and see if it works? Here are my results:

>>> jieba.set_dictionary('user_dict.txt')
>>> jieba.lcut('本文总结了十个可穿戴产品的设计原则，而这些原则，同样也是笔者认为是这个行业最吸引人的地方：1.为人们解决重复性问题；2.从人开始，而不是从机器开始)；3.要引起注意，但不要刻意；4.提升用户能力，而不是取代人', HMM=False)
['本文', '总结', '了', '十', '个', '可', '穿', '戴', '产品', '的', '设计', '原则', '，', '而', '这些', '原则', '，', '同样', '也', '是', '笔者', '认为', '是', '这个', '行业', '最', '吸引', '人', '的', '地方', '：', '1', '.', '为', '人们', '解决', '重复性', '问题', '；', '2', '.', '从', '人', '开始', '，', '而', '不是', '从', '机器', '开始', '；', '3', '.', '要', '引起', '注意', '，', '但', '不', '要', '刻意', '；', '4', '.', '提升', '用户', '能力', '，', '而', '不是', '取 代', '人']

n9705 commented 4 years ago

It works! Thanks for your kind help, sincerely!

thunlp / SDLM-pytorch

请问可以开源Headline Generation数据预处理部分的代码吗，不甚感谢！ #10