Using SegPhrase on other text corpus

palansuya commented 8 years ago

I am trying to run SegPhrase on a dataset of Academic papers on bioengineering. Each line of the of the data is one paper. I have changed to the train_toy.sh file's RAW_TEXT to be my new dataset. The rest it kept the same.

When, however, I run the training, it fail's with the following messages.

Sentences = 79178

tokens = 532593

of distinct tokens = 12633

of frequent pattern = 11862

feature extraction done. ===Auto Label Enable=== Traceback (most recent call last): File "src/classification/auto_label_generation.py", line 112, in kmeans.fit(matrixOther) File "/home/apps/python/python-2.7.9/lib/python2.7/site-packages/sklearn/cluster/kmeans.py", line 1235, in fit X = check_array(X, accept_sparse="csr", order='C', dtype=np.float64) File "/home/apps/python/python-2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 398, in check_array _assert_all_finite(array) File "/home/apps/python/python-2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 54, in _assert_all_finite " or a value too large for %r." % X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). [Warning] failed to open data/wiki.label.auto under parameters = r ./train_ChipSeq.sh: line 52: 46775 Segmentation fault ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model

Sentences = 88897

Unigrams = 15865

[Warning] failed to open results/ranking.csv under parameters = r ./train_ChipSeq.sh: line 58: 46778 Segmentation fault ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1 [Warning] failed to open ./results/penalty.1 under parameters = r ./train_ChipSeq.sh: line 61: 46782 Segmentation fault ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1 [Warning] failed to open data/wiki.label.auto under parameters = r ./train_ChipSeq.sh: line 62: 46783 Segmentation fault ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model [Warning] failed to open results/ranking_1.csv under parameters = r ./train_ChipSeq.sh: line 63: 46784 Segmentation fault ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2 [Warning] failed to open ./results/penalty.2 under parameters = r ./train_ChipSeq.sh: line 66: 46788 Segmentation fault ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model ===Unigram Disable=== [Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r Traceback (most recent call last): File "src/postprocessing/filter_by_support.py", line 37, in main(sys.argv[1:]) File "src/postprocessing/filter_by_support.py", line 17, in main for line in open(segmented_corpus_filename): IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt'

I speculate maybe the each data,i.e. paper, is too long to parse?

What do you suggest I do in order to use SegPhrase on this dataset.

Also may I ask, for the train_dblp.sh, how did you collect or obtain the DBLP.label?

I figure it will provide a higher quality phrases when labels are provide rather than enabling Auto Label feature.

remenberl commented 8 years ago

Thanks for the report. Could you please provide us a link to your dataset? Maybe a small subset should be enough if you can reproduce the same error from it.

The DBLP.label is generated manually. We iteratively add top-ranked prediction errors in unified.csv into the file by running train_dblp.sh. But the performance of auto label features won't be much worse.

On Thu, Feb 25, 2016 at 11:21 AM, palansuya notifications@github.com wrote:

I am trying to run SegPhrase on a dataset of Academic papers on bioengineering. Each line of the of the data is one paper. I have changed to the train_toy.sh file's RAW_TEXT to be my new dataset. The rest it kept the same.

When, however, I run the training, it fail's with the following messages.

Sentences = 79178 tokens = 532593 of distinct tokens = 12633 of frequent pattern = 11862

feature extraction done. ===Auto Label Enable=== Traceback (most recent call last): File "src/classification/auto_label_generation.py", line 112, in kmeans.fit(matrixOther) File "/home/apps/python/python-2.7.9/lib/python2.7/site-packages/sklearn/cluster/kmeans.py", line 1235, in fit X = check_array(X, accept_sparse="csr", order='C', dtype=np.float64) File "/home/apps/python/python-2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 398, in check_array _assert_all_finite(array) File "/home/apps/python/python-2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 54, in _assert_all_finite " or a value too large for %r." % X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). [Warning] failed to open data/wiki.label.auto under parameters = r ./train_ChipSeq.sh: line 52: 46775 Segmentation fault ./bin/predict_quality results/feature_table_0.csv ${DATA_LABEL} results/ranking.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_0.model Sentences = 88897 Unigrams = 15865

[Warning] failed to open results/ranking.csv under parameters = r ./train_ChipSeq.sh: line 58: 46778 Segmentation fault ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/ ${DATA_LABEL} ./results/penalty.1 [Warning] failed to open ./results/penalty.1 under parameters = r ./train_ChipSeq.sh: line 61: 46782 Segmentation fault ./bin/recompute_features results/iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/length results/feature_table_0.csv results/patterns.csv tmp/sentencesWithPunc.buf results/feature_table_1.csv ./results/penalty.1 1 [Warning] failed to open data/wiki.label.auto under parameters = r ./train_ChipSeq.sh: line 62: 46783 Segmentation fault ./bin/predict_quality results/feature_table_1.csv ${DATA_LABEL} results/ranking_1.csv outsideSentence,log_occur_feature,constant,frequency 0 TRAIN results/random_forest_1.model [Warning] failed to open results/ranking_1.csv under parameters = r ./train_ChipSeq.sh: line 63: 46784 Segmentation fault ./bin/adjust_probability tmp/sentences.buf ${OMP_NUM_THREADS} results/ranking_1.csv results/patterns.csv ${DISCARD_RATIO} ${MAX_ITERATION} ./results/1. ${DATA_LABEL} ./results/penalty.2 [Warning] failed to open ./results/penalty.2 under parameters = r ./train_ChipSeq.sh: line 66: 46788 Segmentation fault ./bin/build_model results/1.iter${MAX_ITERATION_1}_discard${DISCARD_RATIO}/ 6 ./results/penalty.2 results/segmentation.model ===Unigram Disable=== [Warning] failed to open results/1.iter6_discard0.00//length1.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length2.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length3.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length4.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length5.csv under parameters = r [Warning] failed to open results/1.iter6_discard0.00//length6.csv under parameters = r Traceback (most recent call last): File "src/postprocessing/filter_by_support.py", line 37, in main(sys.argv[1:]) File "src/postprocessing/filter_by_support.py", line 17, in main for line in open(segmented_corpus_filename): IOError: [Errno 2] No such file or directory: 'results/1.iter5_discard0.00/segmented.txt'

I speculate maybe the each data,i.e. paper, is too long to parse?

What do you suggest I do in order to use SegPhrase on this dataset.

Also may I ask, for the train_dblp.sh, how did you collect or obtain the DBLP.label?

I figure it will provide a higher quality phrases when labels are provide rather than enabling Auto Label feature.

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shangjingbo1226_SegPhrase_issues_3&d=BQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=IAxOHWy6pUlB2vSb7hdimeVIiVwT1qTkRAhi3QtPZ-g&s=G-rM4PSGEIQ1GugrC352K0MDBa0oG2beCkbSRhmZvto&e= .

palansuya commented 8 years ago

https://app.box.com/s/6f8r9eift3ws3b7mqli291pc4nvjwq1d

It is not too large compared to the DBLP dataset.

remenberl commented 8 years ago

The bug is solved in the most recent update of segphrase. It is because that some phrase candidates appear in all documents and thus their idf's become 0. Meanwhile, there exists a feature using idf in the denominator which causes the problem.

On Thu, Feb 25, 2016 at 12:25 PM, palansuya notifications@github.com wrote:

https://app.box.com/s/6f8r9eift3ws3b7mqli291pc4nvjwq1d https://urldefense.proofpoint.com/v2/url?u=https-3A__app.box.com_s_6f8r9eift3ws3b7mqli291pc4nvjwq1d&d=BQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=ocYEG6RXWQ_eaiBpKQLF8sjcCL_GvGXD9f8g6IXmgvM&s=ZdKHt8ukmrAqMx2BnlToiQcDoePTRgix3towcUMlbMk&e=

It is not too large compared to the DBLP dataset.

— Reply to this email directly or view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shangjingbo1226_SegPhrase_issues_3-23issuecomment-2D188918578&d=BQMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=H8V6IY1Ri4Wi-MGEsmcPlcCF6CCoHHZ39Q7bGZ2qEz0&m=ocYEG6RXWQ_eaiBpKQLF8sjcCL_GvGXD9f8g6IXmgvM&s=KC6BfDt3urqR6f5c-sgbqwKzIFPKovVgNykL_wZN1dc&e= .

shangjingbo1226 / SegPhrase

Using SegPhrase on other text corpus #3

Sentences = 79178

tokens = 532593

of distinct tokens = 12633

of frequent pattern = 11862

Sentences = 88897

Unigrams = 15865