musyoku / python-npylm

ベイズ階層言語モデルによる教師なし形態素解析
33 stars 6 forks source link

Does this code has a new version? #2

Open timeconscious opened 3 years ago

timeconscious commented 3 years ago

Does this code has a new version? The current version we have tried several times, however, not running successfully. Is there any specific settings of make install ? We have tried this code on Mac and Ubuntu,and we also use python3,boost_python3, gcc5. However,we have compilation failure. The following are errors from terminal. Could you help us figure out these problems?

First, on mac we input "make install", the terminal shows "ld: library not found for -libboost_python38, clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [install] Error 1". We have tried revise environment variables, the path of BOOST and libboost.

Second, on the other Mac, we input "make install", the terminal shows "ld: symbol(s) not found for architecture x86_64, clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [install] Error 1"

Third, on Ubuntu, we input " python3 train.py -split 0.9 -l 8 -file YOUR_TEXTFILE", the terminal shows "Failed initializing Mecab. Please see the Readme for possible solutions" and "/usr/bin/ld: cannot find -libboost_serialization cannot find -libboost_python35 collect2: error: ld returned 1 exit status makefile:10:recipe for target "install" failed"

We are looking forward to your early reply. Thank you very much.

musyoku commented 3 years ago

I tried make install on my current MacBook (Mojave)

brew upgrade python3
brew upgrade boost-python3
$ python3 --version
Python 3.8.5

I edited makefile as follows

BOOST = /usr/local/Cellar/boost/1.73.0
LDFLAGS = `python3-config --ldflags --embed` -lboost_serialization -lboost_python38 -L$(BOOST)/lib

I changed the path of boost and added --embed option to python3-config.

make install worked for me.

$ make install
g++ `python3-config --includes` -std=c++14 -I/usr/local/Cellar/boost/1.73.0/include -shared -fPIC -march=native src/python.cpp src/python/*.cpp src/npylm/*.cpp src/npylm/lm/*.cpp `python3-config --ldflags --embed` -lboost_serialization -lboost_python38 -L/usr/local/Cellar/boost/1.73.0/lib -o run/npylm.so -O3
timeconscious commented 3 years ago

Thank you very much for your timely reply. We still have several problems.

First, make install is ok, the code can output 'npylm.so'. However, we input "python3 train.py -split 0.9 -l 8 -file Ou r_TEXTFILE", new mistake occur on Ubuntu 16.04, as follows: "#train 17

dev 2

Traceback (most recent call last): File "train.py", line 151, in main() File "train.py", line 107, in main model = npylm.model(dataset, args.max_word_length) # \u53ef\u80fd\u306a\u5358\u8a9e\u306e\u6700\u5927\u9577\u3092\u6307\u5b9a RuntimeError: locale::facet::_S_create_c_locale name not valid".

We have revised the 'locale', it does not work. Could you give us some suggestions?

Second, can this code outputs labels? Such as "B","O","E". How can we output these labels? As we need these labels and segmentation results.

musyoku commented 3 years ago

RuntimeError: locale::facet::_S_create_c_locale name not valid".

compiling with -O0 -g flags will help to find the error location

install: ## npylm.soを生成
    $(CC) $(INCLUDE) $(SOFLAGS) src/python.cpp $(SOURCES) $(LDFLAGS) -o run/npylm.so -O3

install: ## npylm.soを生成
    $(CC) $(INCLUDE) $(SOFLAGS) src/python.cpp $(SOURCES) $(LDFLAGS) -o run/npylm.so -O0 -g

then make install

can this code outputs labels? Such as "B","O","E". How can we output these labels?

I don't understand what "label" means but if you want to run segmentation with the trained model you can use viterbi.py to do that.

python3 train.py --train-filename /path/to/training/textfile --working-directory /path/to/result/directory
python3 viterbi.py --input-filename /path/to/test/textfile --working-directory /path/to/result/directory --output-directory /path/to/segmenation/result/directory
timeconscious commented 3 years ago

Ok, we will try it. Thank you for your kindly reply. Could you tell us why the process of training model costs a lot of time? Is there any introduction document of your proposed algorithm? Could we contact with you by email?

timeconscious commented 3 years ago

Can the training data and test data be the same? How much training data and interaction time can be selected to obtain a better model, and the accuracy of the test sample can be higher?

musyoku commented 3 years ago

Could you tell us why the process of training model costs a lot of time?

we estimate the model parameter (optimal seating-arrangement in Chinese Restaurant Process) by 1) removing a customer assigned to the current segmentation, 2) sample a new segmentation, 3) adding this customer to a randomly selected table again. this optimization process will take a long to converge

Is there any introduction document of your proposed algorithm?

I didn't propose this method. I just reproduce it for my interest. the paper is written in Japanese but maybe there is an English version somewhere...

Can the training data and test data be the same?

It depends on the purpose, but I think it's ok because it is unsupervised learning.

How much training data and interaction time can be selected to obtain a better model, and the accuracy of the test sample can be higher?

I conducted experiments with 884,158 sentences. I don't know about a strategy of maximizing the performance on a test dataset well...

timeconscious commented 3 years ago

Thank you for your reply. Can your methodology output the labels such as Prefix ([/ B] begin), middle ([/ M] middle), suffix ([/ E] end) and word formation ([/ S] single) of test samples's words. That is except the segmentation results, whether the test samples can be processed to the style of labels such as B,M, E, et al.

timeconscious commented 3 years ago

Could you give us some suggestions of how can we get labels such as Begin(B), Intermediate(I),End(E),Other(O) of segmentation words based on your unsupervised segmentation results?

Labels of BIEO can be referenced from the paper “Shallow parsing for hindi-an extensive analysis of sequential learning algorithms using a large annotated corpus”.

timeconscious commented 3 years ago

We have a new question. The interaction time and number of samples, which one is more important for the accuracy of training model?

musyoku commented 3 years ago

how can we get labels such as Begin(B), Intermediate(I),End(E),Other(O) of segmentation words

our code doesn't support tagging because NPYLM is not a method for sequential labeling NPYLM is just a word-level n-gram language model, and unsupervised word segmentation is done by splitting characters based on learned n-gram probability the NPYLM paper proposed a method to train a word-level n-gram model on unsegmented character sequences

parse() yields a list of words which is one of the possible segmentation of a given sequence of characters I think you can try labeling methods with the list of words

sentence_str = "レシートをわたさない会社は100%脱税している"
word_list = model.parse(sentence_str)

# `word_list` will be:
# ["レシート", "を", "わ", "た", "さ", "ない", "会社", "は", "100%", "脱税", "している"]

The interaction time and number of samples, which one is more important for the accuracy of training model?

I think increasing the number of training samples would improve segmentation accuracy

timeconscious commented 3 years ago

We are grateful for you reply, which helps us a lot.