Closed krismyu closed 7 years ago
Since it doesn't look like the Python version does anything with praat files yet, which program was this a problem for? According to this page:
http://www.fon.hum.uva.nl/praat/manual/Unicode.html
praat will produce utf-16 by default for non-ascii text, but can be set to produce utf8. (utf8 would, IMO, be a much better default choice, but that is neither here nor there). Python can handle either, and can detect a properly formatted utf-16 file automatically. There will certainly need to be some code (and tests :) for this use case.
I found a TextGrid parser in the nltk project, which I've pulled in as part of pull request #9 (I didn't end up using it yet, though). So if you post your test file, I'll check it, and work with upstream to fix it if the test fails. We'll also want to use it in a unit test to make sure we can copy the non-ascii labels to the output file correctly.
I have a TextGrid parser package here that works with UTF-8 and UTF-16, if it would be useful for this project. There are tests included for UTF-8 with and without BOM, and UTF-16.
@scjs Thanks! We'll take a look at this when we have a chance.
The NLTK TextGrid parser we were using did not support UTF and non-ASCII characters. We ended up using a parser by @kylebgorman et al. since it has a MIT license that is compatible with our current license. The fix is in https://github.com/voicesauce/opensauce-python/commit/ebf5ea8fed7b4778beee629b496f9aeba06ed8db and https://github.com/voicesauce/opensauce-python/commit/f96d41d7a469a5bc0947e054daca20ba414c2c97.
Something that a colleague mentioned to me that we should keep an eye out for in working on the function to read in Praat TextGrids, is that they were having trouble with non-ASCII characters in the TextGrids: apparently, if they had non-ASCII characters anywhere in the TextGrid (even if in a part of the TextGrid not being analyzed, i.e. in another tier), there were problems.