Open timeconscious opened 3 years ago
I tried make install
on my current MacBook (Mojave)
brew upgrade python3
brew upgrade boost-python3
$ python3 --version
Python 3.8.5
I edited makefile as follows
BOOST = /usr/local/Cellar/boost/1.73.0
LDFLAGS = `python3-config --ldflags --embed` -lboost_serialization -lboost_python38 -L$(BOOST)/lib
I changed the path of boost and added --embed
option to python3-config
.
make install
worked for me.
$ make install
g++ `python3-config --includes` -std=c++14 -I/usr/local/Cellar/boost/1.73.0/include -shared -fPIC -march=native src/python.cpp src/python/*.cpp src/npylm/*.cpp src/npylm/lm/*.cpp `python3-config --ldflags --embed` -lboost_serialization -lboost_python38 -L/usr/local/Cellar/boost/1.73.0/lib -o run/npylm.so -O3
Thank you very much for your timely reply. We still have several problems.
First, make install is ok, the code can output 'npylm.so'. However, we input "python3 train.py -split 0.9 -l 8 -file Ou r_TEXTFILE", new mistake occur on Ubuntu 16.04, as follows: "#train 17
Traceback (most recent call last):
File "train.py", line 151, in
We have revised the 'locale', it does not work. Could you give us some suggestions?
Second, can this code outputs labels? Such as "B","O","E". How can we output these labels? As we need these labels and segmentation results.
RuntimeError: locale::facet::_S_create_c_locale name not valid".
compiling with -O0 -g
flags will help to find the error location
install: ## npylm.soを生成
$(CC) $(INCLUDE) $(SOFLAGS) src/python.cpp $(SOURCES) $(LDFLAGS) -o run/npylm.so -O3
↓
install: ## npylm.soを生成
$(CC) $(INCLUDE) $(SOFLAGS) src/python.cpp $(SOURCES) $(LDFLAGS) -o run/npylm.so -O0 -g
then make install
can this code outputs labels? Such as "B","O","E". How can we output these labels?
I don't understand what "label" means but if you want to run segmentation with the trained model you can use viterbi.py
to do that.
python3 train.py --train-filename /path/to/training/textfile --working-directory /path/to/result/directory
python3 viterbi.py --input-filename /path/to/test/textfile --working-directory /path/to/result/directory --output-directory /path/to/segmenation/result/directory
Ok, we will try it. Thank you for your kindly reply. Could you tell us why the process of training model costs a lot of time? Is there any introduction document of your proposed algorithm? Could we contact with you by email?
Can the training data and test data be the same? How much training data and interaction time can be selected to obtain a better model, and the accuracy of the test sample can be higher?
Could you tell us why the process of training model costs a lot of time?
we estimate the model parameter (optimal seating-arrangement in Chinese Restaurant Process) by 1) removing a customer assigned to the current segmentation, 2) sample a new segmentation, 3) adding this customer to a randomly selected table again. this optimization process will take a long to converge
Is there any introduction document of your proposed algorithm?
I didn't propose this method. I just reproduce it for my interest. the paper is written in Japanese but maybe there is an English version somewhere...
Can the training data and test data be the same?
It depends on the purpose, but I think it's ok because it is unsupervised learning.
How much training data and interaction time can be selected to obtain a better model, and the accuracy of the test sample can be higher?
I conducted experiments with 884,158 sentences. I don't know about a strategy of maximizing the performance on a test dataset well...
Thank you for your reply. Can your methodology output the labels such as Prefix ([/ B] begin), middle ([/ M] middle), suffix ([/ E] end) and word formation ([/ S] single) of test samples's words. That is except the segmentation results, whether the test samples can be processed to the style of labels such as B,M, E, et al.
Could you give us some suggestions of how can we get labels such as Begin(B), Intermediate(I),End(E),Other(O) of segmentation words based on your unsupervised segmentation results?
Labels of BIEO can be referenced from the paper “Shallow parsing for hindi-an extensive analysis of sequential learning algorithms using a large annotated corpus”.
We have a new question. The interaction time and number of samples, which one is more important for the accuracy of training model?
how can we get labels such as Begin(B), Intermediate(I),End(E),Other(O) of segmentation words
our code doesn't support tagging because NPYLM is not a method for sequential labeling NPYLM is just a word-level n-gram language model, and unsupervised word segmentation is done by splitting characters based on learned n-gram probability the NPYLM paper proposed a method to train a word-level n-gram model on unsegmented character sequences
parse()
yields a list of words which is one of the possible segmentation of a given sequence of characters
I think you can try labeling methods with the list of words
sentence_str = "レシートをわたさない会社は100%脱税している"
word_list = model.parse(sentence_str)
# `word_list` will be:
# ["レシート", "を", "わ", "た", "さ", "ない", "会社", "は", "100%", "脱税", "している"]
The interaction time and number of samples, which one is more important for the accuracy of training model?
I think increasing the number of training samples would improve segmentation accuracy
We are grateful for you reply, which helps us a lot.
Does this code has a new version? The current version we have tried several times, however, not running successfully. Is there any specific settings of make install ? We have tried this code on Mac and Ubuntu,and we also use python3,boost_python3, gcc5. However,we have compilation failure. The following are errors from terminal. Could you help us figure out these problems?
First, on mac we input "make install", the terminal shows "ld: library not found for -libboost_python38, clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [install] Error 1". We have tried revise environment variables, the path of BOOST and libboost.
Second, on the other Mac, we input "make install", the terminal shows "ld: symbol(s) not found for architecture x86_64, clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [install] Error 1"
Third, on Ubuntu, we input " python3 train.py -split 0.9 -l 8 -file YOUR_TEXTFILE", the terminal shows "Failed initializing Mecab. Please see the Readme for possible solutions" and "/usr/bin/ld: cannot find -libboost_serialization cannot find -libboost_python35 collect2: error: ld returned 1 exit status makefile:10:recipe for target "install" failed"
We are looking forward to your early reply. Thank you very much.