senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

Bug in handling of gzipped input files #19

Open Waino opened 7 years ago

Waino commented 7 years ago

The command line help indicates that gzipped input files are supported. However, if a gzipped training data file or validation data file is given, training fails with UnicodeDecodeError.

File "/l/sgronroo/scratch/theanopy3/bin/theanolm", line 12, in exec(compile(open(file).read(), file, 'exec')) File "/l/sgronroo/scratch/theanopy3/theanolm/bin/theanolm", line 46, in main() File "/l/sgronroo/scratch/theanopy3/theanolm/bin/theanolm", line 41, in main args.command_function(args) File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/commands/train.py", line 303, in train trainer.train() File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/trainers/basictrainer.py", line 132, in train self._validate(perplexity) File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/trainers/localstatisticstrainer.py", line 57, in _validate perplexity = self.scorer.compute_perplexity(self.validation_iter) File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/scoring/textscorer.py", line 130, in compute_perplexity for wordids, , mask in batch_iter: File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/iterators/batchiterator.py", line 93, in next sequence = self._read_sequence() File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/iterators/batchiterator.py", line 168, in _read_sequence for word in utterance_from_line(line)] File "/l/sgronroo/scratch/theanopy3/theanolm/theanolm/iterators/batchiterator.py", line 20, in utterance_from_line line = line.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

It appears that the problem is caused by the mmap access to the file (SentencePointers in iterators/shufflingbatchiterator:66) failing for gzipped files. The transparent unzipping (implemented in TextFileType filetypes.py:95) has no effect when using mmap.

senarvi commented 7 years ago

I'm thinking about this kind of solution: