Open GoogleCodeExporter opened 9 years ago
TTC term suite "merely accepts UTF-8 text files as inputs", as written in the
user's guide.
In fact,text manipulation does not require generally any text encoding
knowledge.
TTC suite usage requires a particular attention to this subject for all the
langages other than English.
CORPORA MUST BE ENCODED IN UTF8.
AS IT IS NOT THE DEFAULT ENCODING IN THE WINDOWS ENVIRONMENT, CONVERSION IS
MANDATORY, BEFORE ANY TTC Term SUITE PROCESSING.
Main text tools on Windows do not use UTF8 as the default encoding. Default
encoding is generaly "ANSI", more precisely either Windows-1252 or ISO 8859-1
for west european langages. See
- UTF8 : http://en.wikipedia.org/wiki/UTF-8
- Windows-1252 : http://en.wikipedia.org/wiki/Windows-1252
- ISO/IEC_8859-1 : http://en.wikipedia.org/wiki/ISO/IEC_8859-1
- character encoding http://w3techs.com/technologies/overview/character_encoding/all
Practically, many free tools provide easy conversion to UTF 8.
Preliminary step:
SGML, XML, HTML have to be converted first to plain text (removal of the tags)
and entities have to be resolved (è -> é,...).
For small plain text corpora, each file can be converted individually.
Notepad ++ converts individuals files in 3 clics:
(http://notepad-plus-plus.org/)
-check encoding : menu coding (encodage in french), if coding in UTF8 (encoder
en UTF8)is checked, nothing to do.
-else (ANSI coding),
--convert to UTF 8 (without BOM) -> coding in UTF8 (encoder en UTF8)is checked
-- save the file
For larger plain text corpora, batch processing is mandatory
Sisulizer's Kaboom is a convenient free solution on Windows, any version.(use Multi-converter tab)
http://www.sisulizer.com/kaboom/kaboom.shtml
http://www.sisulizer.com/support/downloads/download2.shtml?fn=kaboom
Issue can be closed
Original comment by claude.m...@gmail.com
on 26 Jun 2012 at 8:16
Original issue reported on code.google.com by
claude.m...@gmail.com
on 11 Apr 2012 at 5:10Attachments: