shangjingbo1226 / SegPhrase

Apache License 2.0
261 stars 93 forks source link

Error in train_dblp.sh #6

Closed Aleyasen closed 8 years ago

Aleyasen commented 8 years ago

I install the requirements and the run ./train_dblp.sh, but I got the following error. Do I assume to do anything before running the ./train_dblp.sh ?

./train_dblp.sh: line 5: type: pypy: not found
# Sentences = 9790215
# tokens =  103956084
./train_dblp.sh: line 43: 13907 Killed                  ${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT}
[Warning] failed to open results/patterns.csv under parameters = r
./train_dblp.sh: line 47: 13936 Segmentation fault      (core dumped) ./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv
===Auto Label Disable===
320 labels loaded
[Warning] failed to open results/feature_table_0.csv under parameters = r
./train_dblp.sh: line 58: 13938 Segmentation fault      (core dumped) ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model
# Sentences = 10576779
# Unigrams = 472557
[Warning] failed to open results/ranking.csv under parameters = r
./train_dblp.sh: line 64: 13942 Segmentation fault      (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1
[Warning] failed to open ./results/penalty.1 under parameters = r
./train_dblp.sh: line 67: 13951 Segmentation fault      (core dumped) ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1
320 labels loaded
[Warning] failed to open results/feature_table_1.csv under parameters = r
./train_dblp.sh: line 68: 13953 Segmentation fault      (core dumped) ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model
[Warning] failed to open results/ranking_1.csv under parameters = r
./train_dblp.sh: line 69: 13955 Segmentation fault      (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2
[Warning] failed to open ./results/penalty.2 under parameters = r
./train_dblp.sh: line 72: 13960 Segmentation fault      (core dumped) ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model
===Unigram Disable===
[Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r
Traceback (most recent call last):
  File "src/postprocessing/filter_by_support.py", line 37, in <module>
    main(sys.argv[1:])
  File "src/postprocessing/filter_by_support.py", line 17, in main
    for line in open(segmented_corpus_filename):
IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt'
remenberl commented 8 years ago

Hi,

Have you installed pypy in your machine? If not, please delete line 5-7 if the script returns such errors.

Best,

Jialu

On Wed, Jun 22, 2016 at 6:39 AM, Amirhossein Aleyasen < notifications@github.com> wrote:

I install the requirements and the run ./train_dblp.sh, but I got the following error. Do I assume to do anything before running the ./train_dblp.sh ?

./train_dblp.sh: line 5: type: pypy: not found

Sentences = 9790215

tokens = 103956084

./train_dblp.sh: line 43: 13907 Killed ${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT} [Warning] failed to open results/patterns.csv under parameters = r ./train_dblp.sh: line 47: 13936 Segmentation fault (core dumped) ./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv ===Auto Label Disable=== 320 labels loaded [Warning] failed to open results/feature_table_0.csv under parameters = r ./train_dblp.sh: line 58: 13938 Segmentation fault (core dumped) ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model

Sentences = 10576779

Unigrams = 472557

[Warning] failed to open results/ranking.csv under parameters = r ./train_dblp.sh: line 64: 13942 Segmentation fault (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1 [Warning] failed to open ./results/penalty.1 under parameters = r ./train_dblp.sh: line 67: 13951 Segmentation fault (core dumped) ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1 320 labels loaded [Warning] failed to open results/feature_table_1.csv under parameters = r ./train_dblp.sh: line 68: 13953 Segmentation fault (core dumped) ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model [Warning] failed to open results/ranking_1.csv under parameters = r ./train_dblp.sh: line 69: 13955 Segmentation fault (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2 [Warning] failed to open ./results/penalty.2 under parameters = r ./train_dblp.sh: line 72: 13960 Segmentation fault (core dumped) ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model ===Unigram Disable=== [Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r Traceback (most recent call last): File "src/postprocessing/filter_by_support.py", line 37, in main(sys.argv[1:]) File "src/postprocessing/filter_by_support.py", line 17, in main for line in open(segmented_corpus_filename): IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shangjingbo1226_SegPhrase_issues_6&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=IOU_EtXujG0xXaNWk17p8PnNsko1ym4o8RTImK0iHvk&s=mezgUiglRiuvfq0aqVtOhal8rNdR9_BHTKMi8hugXfU&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe_ACLUSLtVxTixQi-5Fn7Lk0FrI4T8nXrnQbks5qORDKgaJpZM4I7oup&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=IOU_EtXujG0xXaNWk17p8PnNsko1ym4o8RTImK0iHvk&s=Fp3fOcmIWqYHFuITjToeIq8S9EbLOBirJCrhHf6HNRk&e= .

Aleyasen commented 8 years ago

Thanks for reply @remenberl . I removed those lines but I still got error:

# Sentences = 9790215
# tokens =  103956084
./train_dblp.sh: line 40: 15052 Killed                  ${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT}
[Warning] failed to open results/patterns.csv under parameters = r
./train_dblp.sh: line 44: 15102 Segmentation fault      (core dumped) ./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv
===Auto Label Disable===
320 labels loaded
[Warning] failed to open results/feature_table_0.csv under parameters = r
./train_dblp.sh: line 55: 15104 Segmentation fault      (core dumped) ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model
# Sentences = 10576779
# Unigrams = 472557
[Warning] failed to open results/ranking.csv under parameters = r
./train_dblp.sh: line 61: 15108 Segmentation fault      (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1
[Warning] failed to open ./results/penalty.1 under parameters = r
./train_dblp.sh: line 64: 15114 Segmentation fault      (core dumped) ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1
320 labels loaded
[Warning] failed to open results/feature_table_1.csv under parameters = r
./train_dblp.sh: line 65: 15116 Segmentation fault      (core dumped) ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model
[Warning] failed to open results/ranking_1.csv under parameters = r
./train_dblp.sh: line 66: 15118 Segmentation fault      (core dumped) ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2
[Warning] failed to open ./results/penalty.2 under parameters = r
./train_dblp.sh: line 69: 15123 Segmentation fault      (core dumped) ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model
===Unigram Disable===
[Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r
[Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r
Traceback (most recent call last):
  File "src/postprocessing/filter_by_support.py", line 37, in <module>
    main(sys.argv[1:])
  File "src/postprocessing/filter_by_support.py", line 17, in main
    for line in open(segmented_corpus_filename):
IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt'

This is my train_dblp.sh now:

#!/bin/bash

export PYTHON=python
export PYPY=python

RAW_TEXT='data/DBLP.txt'
AUTO_LABEL=0
WORDNET_NOUN=0
DATA_LABEL='data/DBLP.label'
KNOWLEDGE_BASE='data/wiki_labels_quality.txt'
KNOWLEDGE_BASE_LARGE='data/wiki_labels_all.txt'

STOPWORD_LIST='data/stopwords.txt'
SUPPORT_THRESHOLD=10

OMP_NUM_THREADS=4
DISCARD_RATIO=0.00
MAX_ITERATION=5

NEED_UNIGRAM=0
ALPHA=0.85

# clearance
rm -rf tmp
rm -rf results

mkdir tmp
mkdir results

if [ ! -e data/DBLP.txt ]; then
    echo ===Downloading dataset=== 
    wget http://dmserv4.cs.illinois.edu/DBLP.txt.gz -O data/DBLP.txt.gz
    gzip -d data/DBLP.txt.gz -f
fi

# preprocessing
./bin/from_raw_to_binary_text ${RAW_TEXT} tmp/sentencesWithPunc.buf

# frequent phrase mining for phrase candidates
${PYPY} ./src/frequent_phrase_mining/main.py -thres ${SUPPORT_THRESHOLD} -o ./results/patterns.csv -raw ${RAW_TEXT}
${PYPY} ./src/preprocessing/compute_idf.py -raw ${RAW_TEXT} -o results/wordIDF.txt

# feature extraction
./bin/feature_extraction tmp/sentencesWithPunc.buf results/patterns.csv ${STOPWORD_LIST} results/wordIDF.txt results/feature_table_0.csv

if [ ${AUTO_LABEL} -eq 1 ];
then
    echo ===Auto Label Enable===
    ${PYTHON} src/classification/auto_label_generation.py ${KNOWLEDGE_BASE} ${KNOWLEDGE_BASE_LARGE} results/feature_table_0.csv results/patterns.csv ${DATA_LABEL}
else
    echo ===Auto Label Disable===
fi

# classifier training
./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model

MAX_ITERATION_1=$(expr $MAX_ITERATION + 1)

# 1-st round
./bin/from_raw_to_binary ${RAW_TEXT} tmp/sentences.buf
./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1

# 2-nd round
./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1
./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model
./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2

# phrase list & segmentation model
./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model

if [ ${NEED_UNIGRAM} -eq 1 ];
then
    echo ===Unigram Enable===
    # unigrams
    normalize_text() {
      awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
      -e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
      -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
      -e 's/«/ /g' | tr 0-9 " "
    }
    normalize_text < results/1.iter${MAX_ITERATION}_discard${DISCARD_RATIO}/segmented.txt > tmp/normalized.txt

    cd word2vec_tool
    make
    cd ..
    ./word2vec_tool/word2vec -train tmp/normalized.txt -output ./results/vectors.bin -cbow 2 -size 300 -window 6 -negative 25 -hs 0 -sample 1e-4 -threads ${OMP_NUM_THREADS} -binary 1 -iter 15
    time ./bin/generateNN results/vectors.bin results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 30 3 results/u2p_nn.txt results/w2w_nn.txt
    ./bin/qualify_unigrams results/vectors.bin results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ results/u2p_nn.txt results/w2w_nn.txt ${ALPHA} results/unified.csv 100 ${STOPWORD_LIST}
else
    echo ===Unigram Disable===
    ./bin/combine_phrases results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ results/unified.csv
fi

${PYPY} src/postprocessing/filter_by_support.py results/unified.csv results/1.iter${MAX_ITERATION}_discard${DISCARD_RATIO}/segmented.txt ${SUPPORT_THRESHOLD} results/salient.csv 

if [ ${WORDNET_NOUN} -eq 1 ];
then
    ${PYPY} src/postprocessing/clean_list_with_wordnet.py -input results/salient.csv -output results/salient.csv 
fi
remenberl commented 8 years ago

Is it because your memory is too small? Try increasing SUPPORT_THRESHOLD if your memory is small.

Aleyasen commented 8 years ago

Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?

remenberl commented 8 years ago

First try 100? Another idea is to sample 10% lines in the input file.

On Wednesday, June 22, 2016, Amirhossein Aleyasen notifications@github.com wrote:

Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.< https://ci6.googleusercontent.com/proxy/aslNd-MRxTVVWYPnExGDzdeS7CyfKfmxHouEjA1K52JTSDoEJkSTv0VBLcxDi76_aCSD1HxPf0MYfrksIqGxfHlyYiq6zTKiyxVuqsZmcn9DZ67sitCCKj1-9mRKoYxpLqZQtwGapoC0zBStUWWYFarm_fzkVw=s0-d-e1-ft#https://github.com/notifications/beacon/ACLUSMG4_k0JfvAxeBiMLAu166NWlo2eks5qOeB_gaJpZM4I7oup.gif

Aleyasen commented 8 years ago

You mean 100G memory?!

On Wed, Jun 22, 2016 at 8:27 PM, Jialu Liu notifications@github.com wrote:

First try 100? Another idea is to sample 10% lines in the input file.

On Wednesday, June 22, 2016, Amirhossein Aleyasen < notifications@github.com> wrote:

Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.<

https://ci6.googleusercontent.com/proxy/aslNd-MRxTVVWYPnExGDzdeS7CyfKfmxHouEjA1K52JTSDoEJkSTv0VBLcxDi76_aCSD1HxPf0MYfrksIqGxfHlyYiq6zTKiyxVuqsZmcn9DZ67sitCCKj1-9mRKoYxpLqZQtwGapoC0zBStUWWYFarm_fzkVw=s0-d-e1-ft#https://github.com/notifications/beacon/ACLUSMG4_k0JfvAxeBiMLAu166NWlo2eks5qOeB_gaJpZM4I7oup.gif

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shangjingbo1226_SegPhrase_issues_6-23issuecomment-2D227925354&d=CwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=tD7uHHXQ7ne2W76BYf3aCwkIT8VWkPkKiXusrlIRDOw&m=_DRcjmuQlcGdc1dydFZCgOrKreecxIlUCgVtuRvMy0Y&s=Yb8jMCQG21lNTDS_UxRF4Mk0Gb0MLssxoUQ2uPhY6qU&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe_AFVNo3Emd-5FrUbKIiJui9FxpWOb-5FzbCFVks5qOeD8gaJpZM4I7oup&d=CwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=tD7uHHXQ7ne2W76BYf3aCwkIT8VWkPkKiXusrlIRDOw&m=_DRcjmuQlcGdc1dydFZCgOrKreecxIlUCgVtuRvMy0Y&s=7Qi5Cj5Ks5DgezBAECjP-_DqdG__m-OehNmYjlWXgaE&e= .

remenberl commented 8 years ago

16gb.

2016年6月22日星期三,Jialu Liu remenberl@gmail.com 写道:

First try 100? Another idea is to sample 10% lines in the input file.

On Wednesday, June 22, 2016, Amirhossein Aleyasen < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Thanks for the reply @remenberl . What is the sufficient memory requirement for the default value of SUPPORT_THRESHOLD?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.< https://ci6.googleusercontent.com/proxy/aslNd-MRxTVVWYPnExGDzdeS7CyfKfmxHouEjA1K52JTSDoEJkSTv0VBLcxDi76_aCSD1HxPf0MYfrksIqGxfHlyYiq6zTKiyxVuqsZmcn9DZ67sitCCKj1-9mRKoYxpLqZQtwGapoC0zBStUWWYFarm_fzkVw=s0-d-e1-ft#https://github.com/notifications/beacon/ACLUSMG4_k0JfvAxeBiMLAu166NWlo2eks5qOeB_gaJpZM4I7oup.gif>

Aleyasen commented 8 years ago

Thanks, it works now.

hananeYS commented 3 years ago

i just leave one line in my dblp5k.txt, but still have the same error . failed to open results/patterns.csv under parameters = r