Reading in Praat TextGrids with non-ASCII characters

voicesauce / opensauce-python

Voice analysis software (Python port of VoiceSauce)

Apache License 2.0

53 stars 16 forks source link

Reading in Praat TextGrids with non-ASCII characters #8

Closed krismyu closed 7 years ago

krismyu commented 8 years ago

Something that a colleague mentioned to me that we should keep an eye out for in working on the function to read in Praat TextGrids, is that they were having trouble with non-ASCII characters in the TextGrids: apparently, if they had non-ASCII characters anywhere in the TextGrid (even if in a part of the TextGrid not being analyzed, i.e. in another tier), there were problems.

bitdancer commented 8 years ago

Since it doesn't look like the Python version does anything with praat files yet, which program was this a problem for? According to this page:

http://www.fon.hum.uva.nl/praat/manual/Unicode.html

praat will produce utf-16 by default for non-ascii text, but can be set to produce utf8. (utf8 would, IMO, be a much better default choice, but that is neither here nor there). Python can handle either, and can detect a properly formatted utf-16 file automatically. There will certainly need to be some code (and tests :) for this use case.

krismyu commented 8 years ago

Yes, I can provide some files for test cases for this when the time comes. It looks like a parser for TextGrids in python was in progress, here, based on the octave/matlab code here.

bitdancer commented 8 years ago

I found a TextGrid parser in the nltk project, which I've pulled in as part of pull request #9 (I didn't end up using it yet, though). So if you post your test file, I'll check it, and work with upstream to fix it if the test fails. We'll also want to use it in a unit test to make sure we can copy the non-ascii labels to the output file correctly.

scjs commented 7 years ago

I have a TextGrid parser package here that works with UTF-8 and UTF-16, if it would be useful for this project. There are tests included for UTF-8 with and without BOM, and UTF-16.

terriyu commented 7 years ago

@scjs Thanks! We'll take a look at this when we have a chance.

terriyu commented 7 years ago

The NLTK TextGrid parser we were using did not support UTF and non-ASCII characters. We ended up using a parser by @kylebgorman et al. since it has a MIT license that is compatible with our current license. The fix is in https://github.com/voicesauce/opensauce-python/commit/ebf5ea8fed7b4778beee629b496f9aeba06ed8db and https://github.com/voicesauce/opensauce-python/commit/f96d41d7a469a5bc0947e054daca20ba414c2c97.