lmaurits / BEASTling

A linguistics-focussed command line tool for generating BEAST XML files.
BSD 2-Clause "Simplified" License
20 stars 6 forks source link

Encoding in tutorial file #227

Open Snedronningen opened 5 years ago

Snedronningen commented 5 years ago

I'm working though the tutorial atm. Can somebody please advise what encoding to choose in the text editor? UTF-8 gives me a decoding error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 66: character maps to

UTF-16 a missing header:

File "C:\Users\PPLS User\AppData\Local\Programs\Python\Python37\lib\site-packages\configparser-3.5.0-py3.7.egg\backports\configparser__init__.py", line 1101, in _read raise MissingSectionHeaderError(fpname, lineno, line) backports.configparser.MissingSectionHeaderError: File contains no section headers.

Win10, Python 3.7.1 64bit

Thank you very much in advance!

Anaphory commented 5 years ago

Hi,

I will try to find out how you might have got that error – did you get it for the first beastling run in the tutorial,

$ beastling ie_vocabulary.conf

or at a later step? Generally, UTF-8 should be the encoding of choice. That is in fact why the example output there contains the line

# -*- coding: utf-8 -*-

However, your error message UnicodeDecodeError: 'charmap' codec can't decode is the symptom of Python trying to read or write characters not recognized by CMD from the Windows command prompt or a file without specified encoding. It might therefore not be a problem with your configuration file, but with error or warning messages BEASTling tries to output back to you.

The thing that confuses me most about this is that it concerns the character 0x90. That is a command character in UTF-8, so I am confused that it appears.

I'll have a look, thank you for bringing this to my attention!

xrotwang commented 5 years ago

Maybe this line in the tutorial

Create a called ie_vocabulary.conf using your favourite text editor

is not specific enough? I don't know abut Win10, but on earlier Windows, this would probably have meant Notepad saving to cp1252.

Anaphory commented 5 years ago

The configuration file usually does not contain anything outside cp1252/‘charmap’.

However, the IPA in our example data file https://raw.githubusercontent.com/lmaurits/BEASTling/release-1.2/docs/tutorial_data/ie_cognates.csv does contain some ‘strange’ (i.e. non-cp1252) characters, including ː, which is '\xcb\x90'.

I think we forgot to load this kind of CSV file with an explicit UTF-8 encoding. In that case, Python under Windows assumes the windows standard encoding for loading files.

Anaphory commented 5 years ago

Actually, it seems that ie_cognates.csv should be cleanly taken through UnicodeDictReader, which correctly guesses the encoding as utf-8-sig and passes that value through to the reader. I'll have a look what's actually going on under Windows right now.

Anaphory commented 5 years ago

I can reproduce the error under Windows. @Snedronningen, Sorry for the problem. I'll try to do both the following things:

Anaphory commented 5 years ago

The issue is actually in our sniffer method:

> c:\users\kaipingga\beastling\beastling\__main__.py(2)<module>()
-> from __future__ import unicode_literals
(Pdb) b beastling/fileio/datareaders.py:64
Breakpoint 1 at c:\users\kaipingga\beastling\beastling\fileio\datareaders.py:64
(Pdb) c
Error encountered while parsing configuration file:
Traceback (most recent call last):
  File "c:\users\kaipingga\beastling\beastling\cli.py", line 119, in do_generate

    config.process()
  File "c:\users\kaipingga\beastling\beastling\configuration.py", line 461, in p
rocess
    self.instantiate_models()
  File "c:\users\kaipingga\beastling\beastling\configuration.py", line 869, in i
nstantiate_models
    model = covarion.CovarionModel(config, self)
  File "c:\users\kaipingga\beastling\beastling\models\covarion.py", line 10, in
__init__
    BinaryModel.__init__(self, model_config, global_config)
  File "c:\users\kaipingga\beastling\beastling\models\binary.py", line 10, in __
init__
    BaseModel.__init__(self, model_config, global_config)
  File "c:\users\kaipingga\beastling\beastling\models\basemodel.py", line 51, in
 __init__
    self.data = load_data(self.data_filename, file_format=model_config.get("file
_format",None), lang_column=model_config.get("language_column",None), value_colu
mn=model_config.get("value_column",None))
  File "c:\users\kaipingga\beastling\beastling\fileio\datareaders.py", line 62,
in load_data
    dialect = sniff(filename)
  File "c:\users\kaipingga\beastling\beastling\fileio\datareaders.py", line 28,
in sniff
    sample = fp.read(1024)
  File "P:\conda\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 68: chara
cter maps to <undefined>
The program exited via sys.exit(). Exit status: 2
> c:\users\kaipingga\beastling\beastling\__main__.py(2)<module>()
-> from __future__ import unicode_literals
(Pdb)
Anaphory commented 5 years ago

@Snedronningen I think the following should work, and I will test it when I have the chance: Add the specific file format description to the tutorial configuration file. It's the file_format = cldf-legacy below, which is necessary for all BEASTling configuration files using the tutorial data under Windows for the moment.

[model ie_vocabulary]
model = covarion
data = ie_cognates.csv
file_format = cldf-legacy
Snedronningen commented 5 years ago

Excellent! Thank you very much!

I normally avoid Windows but since I had massive troubles to get Beauti to function in Beast2 on linux, I used the WIndows version.

If you have a moment, could you explain how the sniffer method caused it? Just out of curiosity.

Thank you very much!

Anaphory commented 5 years ago

Sure! The sniffer function is supposed to guess the CSV dialect in the data file. It opens the data file like

https://github.com/lmaurits/BEASTling/blob/b422d1644e118d87e78c50e7d50194bb21f69edb/beastling/fileio/datareaders.py#L25

and then tries to read some characters from it, hoping to find enough , or \ts to make up its mind about TSV vs. CSV, quoting of text cells etc. in

https://github.com/lmaurits/BEASTling/blob/b422d1644e118d87e78c50e7d50194bb21f69edb/beastling/fileio/datareaders.py#L28

Now, the .open() method in L25 uses the operating system's standard encoding, which is UTF-8 under Linux (so ː is fine), but CP1252 under Windows (which does not recognize the \x90 in its UTF-8 encoding). So when .read() encounters the ː character under Linux, everything is fine. When it's under Windows, the error you reported occurs. When you specify cldf-legacy, the file format is not guessed, but deduced from the file extension (tsv or csv), so this error never occurs because the actual loading is done by UnicodeDictReader, which assumes UTF-8 unless told otherwise.

Anaphory commented 5 years ago

My commiserations to trying to use Beauti on any platform. Did you manage to get the tutorial to work with file_format = cldf-legacy? I haven't been in front of a Windows computer since, so I didn't manage to test it yet.