Training dataset for the baseline segmenter

vatile / CWS-NAACL2019

Code and data for the NAACL 2019 paper "Improving Cross-Domain Chinese Word Segmentation with Word Embeddings"

Mozilla Public License 2.0

10 stars 4 forks source link

Training dataset for the baseline segmenter #1

Open countback opened 5 years ago

countback commented 5 years ago

Sorry to be a bother. I read your NAACL 2019 paper and I am very interested in these two papers. The improvements on cross domain chinese cws makes me feel excited. Presently, I am doing some related issues, and I wonder if you can release the People Daily 2000 Jan dataset used to pre-train the baseline segmenter which would give me a great of help in reproducing the results reported in paper and comprehending your algorithm. I will be appreciated for your reply, thank you very much.

vatile commented 5 years ago

@countback Sorry I'm afraid that I cannot release the People's Daily dataset due to license issues. But I suggest you to contact the Institute of Computational Linguistics at Peking University for that kind of data.

If you just want to reproduce the reported results in the paper, you can use the data segmented by the baseline segmenter, which is provided in the repo.