larsmans / seqlearn

Sequence learning toolkit for Python
http://larsmans.github.io/seqlearn/
MIT License
688 stars 102 forks source link

load_conll, unicode support #19

Open alexeyev opened 8 years ago

alexeyev commented 8 years ago

Hi, loading data in conll format fails on my custom dataset with non-ascii characters. So when I read data with encoding 'utf-8' set, I get corresponding errors here:

  File "/usr/local/lib/python2.7/dist-packages/seqlearn/datasets.py", line 65, in <genexpr>
    lines = (str.split(line) for line in  f)
TypeError: descriptor 'split' requires a 'str' object but received a 'unicode'
def _conll_sequences(f, features, labels, lengths, split):
    # Divide input into blocks of empty and non-empty lines.
    lines = (str.strip(line) for line in  f)

Everything works perfectly, when I modify the last line like that:

 lines = (line.strip() for line in  f)

Is there anything that makes such fix unwanted?

alexeyev commented 8 years ago

Hi, the project isn't supported anymore, is it?