Closed Aleyasen closed 8 years ago
Hi,
Have you installed pypy in your machine? If not, please delete line 5-7 if the script returns such errors.
Best,
Jialu
On Wed, Jun 22, 2016 at 6:39 AM, Amirhossein Aleyasen < notifications@github.com> wrote:
I install the requirements and the run ./train_dblp.sh, but I got the following error. Do I assume to do anything before running the ./train_dblp.sh ?
./train_dblp.sh: line 5: type: pypy: not found
Sentences = 9790215
tokens = 103956084
./train_dblp.sh: line 43: 13907 Killed ${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT} [Warning] failed to open results/patterns.csv under parameters = r ./train_dblp.sh: line 47: 13936 Segmentation fault (core dumped) ./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv ===Auto Label Disable=== 320 labels loaded [Warning] failed to open results/feature_table_0.csv under parameters = r ./train_dblp.sh: line 58: 13938 Segmentation fault (core dumped) ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model
Sentences = 10576779
Unigrams = 472557
[Warning] failed to open results/ranking.csv under parameters = r ./train_dblp.sh: line 64: 13942 Segmentation fault (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1 [Warning] failed to open ./results/penalty.1 under parameters = r ./train_dblp.sh: line 67: 13951 Segmentation fault (core dumped) ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1 320 labels loaded [Warning] failed to open results/feature_table_1.csv under parameters = r ./train_dblp.sh: line 68: 13953 Segmentation fault (core dumped) ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model [Warning] failed to open results/ranking_1.csv under parameters = r ./train_dblp.sh: line 69: 13955 Segmentation fault (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2 [Warning] failed to open ./results/penalty.2 under parameters = r ./train_dblp.sh: line 72: 13960 Segmentation fault (core dumped) ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model ===Unigram Disable=== [Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r Traceback (most recent call last): File "src/postprocessing/filter_by_support.py", line 37, in
main(sys.argv[1:]) File "src/postprocessing/filter_by_support.py", line 17, in main for line in open(segmented_corpus_filename): IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt' — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shangjingbo1226_SegPhrase_issues_6&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=IOU_EtXujG0xXaNWk17p8PnNsko1ym4o8RTImK0iHvk&s=mezgUiglRiuvfq0aqVtOhal8rNdR9_BHTKMi8hugXfU&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe_ACLUSLtVxTixQi-5Fn7Lk0FrI4T8nXrnQbks5qORDKgaJpZM4I7oup&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=IOU_EtXujG0xXaNWk17p8PnNsko1ym4o8RTImK0iHvk&s=Fp3fOcmIWqYHFuITjToeIq8S9EbLOBirJCrhHf6HNRk&e= .
Thanks for reply @remenberl . I removed those lines but I still got error:
# Sentences = 9790215
# tokens = 103956084
./train_dblp.sh: line 40: 15052 Killed ${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT}
[Warning] failed to open results/patterns.csv under parameters = r
./train_dblp.sh: line 44: 15102 Segmentation fault (core dumped) ./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv
===Auto Label Disable===
320 labels loaded
[Warning] failed to open results/feature_table_0.csv under parameters = r
./train_dblp.sh: line 55: 15104 Segmentation fault (core dumped) ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model
# Sentences = 10576779
# Unigrams = 472557
[Warning] failed to open results/ranking.csv under parameters = r
./train_dblp.sh: line 61: 15108 Segmentation fault (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1
[Warning] failed to open ./results/penalty.1 under parameters = r
./train_dblp.sh: line 64: 15114 Segmentation fault (core dumped) ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1
320 labels loaded
[Warning] failed to open results/feature_table_1.csv under parameters = r
./train_dblp.sh: line 65: 15116 Segmentation fault (core dumped) ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model
[Warning] failed to open results/ranking_1.csv under parameters = r
./train_dblp.sh: line 66: 15118 Segmentation fault (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2
[Warning] failed to open ./results/penalty.2 under parameters = r
./train_dblp.sh: line 69: 15123 Segmentation fault (core dumped) ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model
===Unigram Disable===
[Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r
Traceback (most recent call last):
File "src/postprocessing/filter_by_support.py", line 37, in <module>
main(sys.argv[1:])
File "src/postprocessing/filter_by_support.py", line 17, in main
for line in open(segmented_corpus_filename):
IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt'
This is my train_dblp.sh
now:
#!/bin/bash
export PYTHON=python
export PYPY=python
RAW_TEXT='data/DBLP.txt'
AUTO_LABEL=0
WORDNET_NOUN=0
DATA_LABEL='data/DBLP.label'
KNOWLEDGE_BASE='data/wiki_labels_quality.txt'
KNOWLEDGE_BASE_LARGE='data/wiki_labels_all.txt'
STOPWORD_LIST='data/stopwords.txt'
SUPPORT_THRESHOLD=10
OMP_NUM_THREADS=4
DISCARD_RATIO=0.00
MAX_ITERATION=5
NEED_UNIGRAM=0
ALPHA=0.85
# clearance
rm -rf tmp
rm -rf results
mkdir tmp
mkdir results
if [ ! -e data/DBLP.txt ]; then
echo ===Downloading dataset===
wget http://dmserv4.cs.illinois.edu/DBLP.txt.gz -O data/DBLP.txt.gz
gzip -d data/DBLP.txt.gz -f
fi
# preprocessing
./bin/from_raw_to_binary_text ${RAW_TEXT} tmp/sentencesWithPunc.buf
# frequent phrase mining for phrase candidates
${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT}
${PYPY} ./src/preprocessing/compute_idf.py -raw ${RAW_TEXT} -o results/wordIDF.txt
# feature extraction
./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv
if [ ${AUTO_LABEL} -eq 1 ];
then
echo ===Auto Label Enable===
${PYTHON} src/classification/auto_label_generation.py ${KNOWLEDGE_BASE} ${KNOWLEDGE_BASE_LARGE} results/feature_table_0.csv results/patterns.csv ${DATA_LABEL}
else
echo ===Auto Label Disable===
fi
# classifier training
./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model
MAX_ITERATION_1=$(expr $MAX_ITERATION + 1)
# 1-st round
./bin/from_raw_to_binary ${RAW_TEXT} tmp/sentences.buf
./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1
# 2-nd round
./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1
./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model
./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2
# phrase list & segmentation model
./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model
if [ ${NEED_UNIGRAM} -eq 1 ];
then
echo ===Unigram Enable===
# unigrams
normalize_text() {
awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
-e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
-e 's/«/ /g' | tr 0-9 " "
}
normalize_text < results/1.iter${MAX_ITERATION}_discard${DISCARD_RATIO}/segmented.txt > tmp/normalized.txt
cd word2vec_tool
make
cd ..
./word2vec_tool/word2vec -train tmp/normalized.txt -output ./results/vectors.bin -cbow 2 -size 300 -window 6 -negative 25 -hs 0 -sample 1e-4 -threads ${OMP_NUM_THREADS} -binary 1 -iter 15
time ./bin/generateNN results/vectors.bin results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 30 3 results/u2p_nn.txt results/w2w_nn.txt
./bin/qualify_unigrams results/vectors.bin results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ results/u2p_nn.txt results/w2w_nn.txt ${ALPHA} results/unified.csv 100 ${STOPWORD_LIST}
else
echo ===Unigram Disable===
./bin/combine_phrases results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ results/unified.csv
fi
${PYPY} src/postprocessing/filter_by_support.py results/unified.csv results/1.iter${MAX_ITERATION}_discard${DISCARD_RATIO}/segmented.txt ${SUPPORT_THRESHOLD} results/salient.csv
if [ ${WORDNET_NOUN} -eq 1 ];
then
${PYPY} src/postprocessing/clean_list_with_wordnet.py -input results/salient.csv -output results/salient.csv
fi
Is it because your memory is too small? Try increasing SUPPORT_THRESHOLD if your memory is small.
Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?
First try 100? Another idea is to sample 10% lines in the input file.
On Wednesday, June 22, 2016, Amirhossein Aleyasen notifications@github.com wrote:
Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.< https://ci6.googleusercontent.com/proxy/aslNd-MRxTVVWYPnExGDzdeS7CyfKfmxHouEjA1K52JTSDoEJkSTv0VBLcxDi76_aCSD1HxPf0MYfrksIqGxfHlyYiq6zTKiyxVuqsZmcn9DZ67sitCCKj1-9mRKoYxpLqZQtwGapoC0zBStUWWYFarm_fzkVw=s0-d-e1-ft#https://github.com/notifications/beacon/ACLUSMG4_k0JfvAxeBiMLAu166NWlo2eks5qOeB_gaJpZM4I7oup.gif
You mean 100G memory?!
On Wed, Jun 22, 2016 at 8:27 PM, Jialu Liu notifications@github.com wrote:
First try 100? Another idea is to sample 10% lines in the input file.
On Wednesday, June 22, 2016, Amirhossein Aleyasen < notifications@github.com> wrote:
Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.<
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shangjingbo1226_SegPhrase_issues_6-23issuecomment-2D227925354&d=CwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=tD7uHHXQ7ne2W76BYf3aCwkIT8VWkPkKiXusrlIRDOw&m=_DRcjmuQlcGdc1dydFZCgOrKreecxIlUCgVtuRvMy0Y&s=Yb8jMCQG21lNTDS_UxRF4Mk0Gb0MLssxoUQ2uPhY6qU&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe_AFVNo3Emd-5FrUbKIiJui9FxpWOb-5FzbCFVks5qOeD8gaJpZM4I7oup&d=CwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=tD7uHHXQ7ne2W76BYf3aCwkIT8VWkPkKiXusrlIRDOw&m=_DRcjmuQlcGdc1dydFZCgOrKreecxIlUCgVtuRvMy0Y&s=7Qi5Cj5Ks5DgezBAECjP-_DqdG__m-OehNmYjlWXgaE&e= .
16gb.
2016年6月22日星期三,Jialu Liu remenberl@gmail.com 写道:
First try 100? Another idea is to sample 10% lines in the input file.
On Wednesday, June 22, 2016, Amirhossein Aleyasen < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:
Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.< https://ci6.googleusercontent.com/proxy/aslNd-MRxTVVWYPnExGDzdeS7CyfKfmxHouEjA1K52JTSDoEJkSTv0VBLcxDi76_aCSD1HxPf0MYfrksIqGxfHlyYiq6zTKiyxVuqsZmcn9DZ67sitCCKj1-9mRKoYxpLqZQtwGapoC0zBStUWWYFarm_fzkVw=s0-d-e1-ft#https://github.com/notifications/beacon/ACLUSMG4_k0JfvAxeBiMLAu166NWlo2eks5qOeB_gaJpZM4I7oup.gif>
Thanks, it works now.
i just leave one line in my dblp5k.txt, but still have the same error . failed to open results/patterns.csv under parameters = r
I install the requirements and the run
./train_dblp.sh
, but I got the following error. Do I assume to do anything before running the./train_dblp.sh
?