glample / tagger

Named Entity Recognition Tool
Apache License 2.0
1.16k stars 426 forks source link

ValueError: max() arg is an empty sequence #58

Closed victoriastuart closed 7 years ago

victoriastuart commented 7 years ago

Two issues:

  1. Others (e.g. issues #20 , #41 ) asked what a 'tokenized sentence' is; that puzzled me too. Answer: any sentence is 'tokenized'; e.g.

    Victoria was born in 1961 in Halifax, Nova Scotia, Canada.

  2. If your input file contains blank lines, e.g.

    Victoria was born in 1961 in Halifax, Nova Scotia, Canada.
    
     Victoria used to work at NIEHS in North Carolina.

then tagger.py | utils.py throws an error:

...
    max_length = max([len(word) for word in words])
ValueError: max() arg is an empty sequence

You can solve that, simply, by changing the following lines in tagger.py

Original:

print 'Tagging...'
with codecs.open(opts.input, 'r', 'utf-8') as f_input:
    count = 0
    for line in f_input:
        words = line.rstrip().split()

Modified:

print 'Tagging...'
with codecs.open(opts.input, 'r', 'utf-8') as f_input:
    count = 0
    for line in f_input:
        if len(line) <= 1:
            line = ''
        words = line.rstrip().split()

Added lines:

        if len(line) <= 1:
            line = ''
nkruglikov commented 7 years ago

@victoriastuart Thanks a lot, you just saved me a lot of time!

Rabia-Noureen commented 7 years ago

Hi @victoriastuart @nkruglikov I am new to python can you please help me out with training the model using GoogleNews word embeddings? I am trying to train using the script

python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob --pre_emb=GoogleNews-vectors-negative300.bin --all_emb=300

I got this error: image

I am stuck with this issue for about 2 months and couldn't resolve it. Thanks in advance.