segment_long_utterances.sh failing on decode_segmentation

nizmagu commented 7 years ago

I was trying to use segment_long_utterances.sh on 6 5-hour-long files. Upon reaching stage 4, I get the following message:

steps/cleanup/decode_segmentation.sh --beam 15.0 --lattice-beam 1.0 --nj 6 --cmd run.pl --mem 4G --skip-scoring true --allow-partial false exp/segment_train_long/graphs_uniform_seg exp/segment_train_long/train_long_uniform_seg exp/segment_train_long/lats
filter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt] 
steps/cleanup/decode_segmentation.sh: feature type is lda
run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/decode.*.log

When inspecting the log files, decode_segmentation gave this error:

ERROR (gmm-latgen-faster[5.1.92-a7e61]:FindKeyInternal():util/kaldi-table-inl.h:2122) You provided the "cs" option but are not calling with keys in sorted order: 2010_07_19_9050-1000000-1003000 < 2010_07_19_9050-585000-588000: rspecifier is ark,s,cs:apply-cmvn  --utt2spk=ark:exp/segment_train_long/train_long_uniform_seg/split6/2/utt2spk scp:exp/segment_train_long/train_long_uniform_seg/split6/2/cmvn.scp scp:exp/segment_train_long/train_long_uniform_seg/split6/2/feats.scp ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/segment_train_long/final.mat ark:- ark:- |

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::FindKeyInternal(std::string const&)
kaldi::RandomAccessTableReaderDSortedArchiveImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::HasKey(std::string const&)
kaldi::RandomAccessTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::HasKey(std::string const&)
main
__libc_start_main
gmm-latgen-faster() [0x4639d9]

# Accounting: time=7958 threads=1
# Ended (code 255) at Thu May 18 15:16:07 IDT 2017, elapsed time 7958 seconds

I tried to use validate_data_dir.sh and it says the files are in sorted order. I believed it may have something to do with locale so I used export LANG= and export LC_ALL=C and checked with sort -c to no avail.

How can I fix this issue?

danpovey commented 7 years ago

This looks like an error in the script, not a user error; Vimal will fix it today hopefully.

vimalmanohar commented 7 years ago

@nizmagu Can you check if this solves the problem?

nizmagu commented 7 years ago

This solves the decode problem, however a new problem came up.

The script crashed at stage 9 with the following error: run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/retrieve_similar_docs.*.log

Here is a sample log file:

# steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text2tfidf-file=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt 
# Started at Mon May 22 14:07:50 IDT 2017
#
usage: retrieve_similar_docs.py [-h] [--verbose {0,1,2,3}]
                                [--num-neighbors-to-search NUM_NEIGHBORS_TO_SEARCH]
                                [--neighbor-tfidf-threshold NEIGHBOR_TFIDF_THRESHOLD]
                                [--partial-doc-fraction PARTIAL_DOC_FRACTION]
                                --source-text-id2doc-ids
                                SOURCE_TEXT_ID2DOC_IDS
                                --query-id2source-text-id
                                QUERY_ID2SOURCE_TEXT_ID --source-text-id2tfidf
                                SOURCE_TEXT_ID2TFIDF --query-tfidf QUERY_TFIDF
                                --relevant-docs RELEVANT_DOCS
retrieve_similar_docs.py: error: argument --source-text-id2tfidf is required
# Accounting: time=0 threads=1
# Ended (code 2) at Mon May 22 14:07:50 IDT 2017, elapsed time 0 seconds

I tried to change --source-text2tfidf-file to --source-text-id2tfidf and this was the result:

# steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text-id2tfidf=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt 
# Started at Mon May 22 14:11:06 IDT 2017
#
2017-05-22 14:11:06,790 [retrieve_similar_docs.py:336 - run - INFO ] Retrieved similar documents for 0 queries
Traceback (most recent call last):
  File "steps/cleanup/internal/retrieve_similar_docs.py", line 353, in <module>
    main()
  File "steps/cleanup/internal/retrieve_similar_docs.py", line 348, in main
    args.relevant_docs, args.query_tfidf, args.source_tfidf]:
AttributeError: 'Namespace' object has no attribute 'source_tfidf'
# Accounting: time=0 threads=1
# Ended (code 1) at Mon May 22 14:11:06 IDT 2017, elapsed time 0 seconds

args.source_tfidf seemed to only be referenced once (in the closing command), so I changed it again to args.source_text_id2tfidf.

Then the script crashed again: run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/get_ctm_edits.*.log

Here is the log file:

# steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt --input-documents=exp/segment_train_long/docs/split6/1/docs.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="<eps>" --oov-word='<UNK>' --symbol-table=data/lang/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm.1 --ref=- --output=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm_edits.1 
# Started at Mon May 22 14:15:01 IDT 2017
#
Traceback (most recent call last):
  File "steps/cleanup/internal/align_ctm_ref.py", line 615, in <module>
    main()
  File "steps/cleanup/internal/align_ctm_ref.py", line 598, in main
    args = get_args()
  File "steps/cleanup/internal/align_ctm_ref.py", line 103, in get_args
    "--reco2file-and-channel must be provided for "
RuntimeError: --reco2file-and-channel must be provided for hyp-format=CTM
usage: stitch_documents.py [-h] --query2docs QUERY2DOCS --input-documents
                           INPUT_DOCUMENTS --output-documents OUTPUT_DOCUMENTS
                           [--check-sorted-docs-per-query {true,false}]
stitch_documents.py: error: argument --input-documents: can't open 'exp/segment_train_long/docs/split6/1/docs.txt': [Errno 2] No such file or directory: 'exp/segment_train_long/docs/split6/1/docs.txt'
# Accounting: time=0 threads=1
# Ended (code 1) at Mon May 22 14:15:01 IDT 2017, elapsed time 0 seconds

vimalmanohar commented 7 years ago

I'll create a pull request soon.

On Mon, May 22, 2017, 07:18 nizmagu notifications@github.com wrote:

This solves the decode problem, however a new problem came up.

The script crashed at stage 9 with the following error: run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/retrieve_similar_docs.*.log

Here is a sample log file:

steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text2tfidf-file=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt

Started at Mon May 22 14:07:50 IDT 2017

# usage: retrieve_similar_docs.py [-h] [--verbose {0,1,2,3}] [--num-neighbors-to-search NUM_NEIGHBORS_TO_SEARCH] [--neighbor-tfidf-threshold NEIGHBOR_TFIDF_THRESHOLD] [--partial-doc-fraction PARTIAL_DOC_FRACTION] --source-text-id2doc-ids SOURCE_TEXT_ID2DOC_IDS --query-id2source-text-id QUERY_ID2SOURCE_TEXT_ID --source-text-id2tfidf SOURCE_TEXT_ID2TFIDF --query-tfidf QUERY_TFIDF --relevant-docs RELEVANT_DOCS retrieve_similar_docs.py: error: argument --source-text-id2tfidf is required

Accounting: time=0 threads=1

Ended (code 2) at Mon May 22 14:07:50 IDT 2017, elapsed time 0 seconds

I tried to change --source-text2tfidf-file to --source-text-id2tfidf and this was the result:

steps/cleanup/internal/retrieve_similar_docs.py --query-tfidf=exp/segment_train_long/query_docs/split6/query_tf_idf.1.ark.txt --source-text-id2tfidf=exp/segment_train_long/docs/source2tf_idf.scp --source-text-id2doc-ids=exp/segment_train_long/docs/text2doc --query-id2source-text-id=exp/segment_train_long/new2orig_utt --num-neighbors-to-search=1 --neighbor-tfidf-threshold=0.5 --relevant-docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt

Started at Mon May 22 14:11:06 IDT 2017

# 2017-05-22 14:11:06,790 [retrieve_similar_docs.py:336 - run - INFO ] Retrieved similar documents for 0 queries Traceback (most recent call last): File "steps/cleanup/internal/retrieve_similar_docs.py", line 353, in main() File "steps/cleanup/internal/retrieve_similar_docs.py", line 348, in main args.relevant_docs, args.query_tfidf, args.source_tfidf]: AttributeError: 'Namespace' object has no attribute 'source_tfidf'

Accounting: time=0 threads=1

Ended (code 1) at Mon May 22 14:11:06 IDT 2017, elapsed time 0 seconds

args.source_tfidf seemed to only be referenced once (in the closing command), so I changed it again to args.source_text_id2tfidf.

Then the script crashed again: run.pl: 6 / 6 failed, log is in exp/segment_train_long/lats/log/get_ctm_edits.*.log

Here is the log file:

steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_train_long/query_docs/split6/relevant_docs.1.txt --input-documents=exp/segment_train_long/docs/split6/1/docs.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="" --oov-word='' --symbol-table=data/lang/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm.1 --ref=- --output=exp/segment_train_long/lats/score_10/train_long_uniform_seg.ctm_edits.1

Started at Mon May 22 14:15:01 IDT 2017

# Traceback (most recent call last): File "steps/cleanup/internal/align_ctm_ref.py", line 615, in main() File "steps/cleanup/internal/align_ctm_ref.py", line 598, in main args = get_args() File "steps/cleanup/internal/align_ctm_ref.py", line 103, in get_args "--reco2file-and-channel must be provided for " RuntimeError: --reco2file-and-channel must be provided for hyp-format=CTM usage: stitch_documents.py [-h] --query2docs QUERY2DOCS --input-documents INPUT_DOCUMENTS --output-documents OUTPUT_DOCUMENTS [--check-sorted-docs-per-query {true,false}] stitch_documents.py: error: argument --input-documents: can't open 'exp/segment_train_long/docs/split6/1/docs.txt': [Errno 2] No such file or directory: 'exp/segment_train_long/docs/split6/1/docs.txt'

Accounting: time=0 threads=1

Ended (code 1) at Mon May 22 14:15:01 IDT 2017, elapsed time 0 seconds

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1629#issuecomment-303071815, or mute the thread https://github.com/notifications/unsubscribe-auth/AEATV4yNpLasXvMnukCyi13GwmxjtpWmks5r8W8NgaJpZM4NfWes .

-- Vimal Manohar PhD Student Electrical & Computer Engineering Johns Hopkins University

vimalmanohar commented 7 years ago

I fixed some issues in #1639

nizmagu commented 7 years ago

That fixed the issue, thanks a lot!

danpovey commented 4 years ago

This issue may still exist for the _nnet3 versions of these scripts. See https://groups.google.com/d/msgid/kaldi-help/a783ff67-7cb4-4f7e-b2db-5b67ca032478%40googlegroups.com.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kaldi-asr / kaldi

segment_long_utterances.sh failing on decode_segmentation #1629

Started at Mon May 22 14:07:50 IDT 2017

Accounting: time=0 threads=1

Ended (code 2) at Mon May 22 14:07:50 IDT 2017, elapsed time 0 seconds

Started at Mon May 22 14:11:06 IDT 2017

Accounting: time=0 threads=1

Ended (code 1) at Mon May 22 14:11:06 IDT 2017, elapsed time 0 seconds

Started at Mon May 22 14:15:01 IDT 2017

Accounting: time=0 threads=1

Ended (code 1) at Mon May 22 14:15:01 IDT 2017, elapsed time 0 seconds