sanju2010 / ttc-project

Automatically exported from code.google.com/p/ttc-project
0 stars 0 forks source link

ttc-term-suite-1.2 on Windows : french char not supported #11

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Launch ttc-term-suite-1.2 on Windows 
2.Select the French directory of provided Sample (wind energy)  
3.Run

What is the expected output? What do you see instead?
expected: French words with accents (é,è, î,...)
result : French words with ? instead of accent 
(g?n?ratrice,l??olienne,l??lectricit,...), 
see attached file.  

What version of the product are you using? On what operating system?
ttc-term-suite-1.2 on Windows,reproduced on any version (XP SP3,7 , 32bits and 
64 bits versions)

Please provide any additional information below.
Tested with German sample, it works (ö,ü,...)
Problem can be observed in xmi file, whatever is the encoding of input txt 
files (tested in ANSI and UTF8, same result)
Sample is ANSI with LF only.(CR LF in input files does not seem to be properly 
supported)

Unable to evaluate ttc-term-suite on French corpora.

Original issue reported on code.google.com by claude.m...@gmail.com on 11 Apr 2012 at 5:10

Attachments:

GoogleCodeExporter commented 9 years ago
TTC term suite  "merely accepts UTF-8 text files as inputs", as written in the 
user's guide.

In fact,text manipulation does not require generally any text encoding 
knowledge. 
TTC suite usage requires a particular attention to this subject for all the 
langages other than English.
CORPORA MUST BE ENCODED IN UTF8. 

AS IT IS NOT THE DEFAULT ENCODING IN THE WINDOWS ENVIRONMENT, CONVERSION IS 
MANDATORY, BEFORE ANY TTC Term SUITE PROCESSING. 

Main text tools on Windows do not use UTF8 as the default encoding. Default 
encoding is generaly "ANSI", more precisely either Windows-1252 or ISO 8859-1 
for west european langages. See 
 - UTF8 : http://en.wikipedia.org/wiki/UTF-8 
 - Windows-1252 : http://en.wikipedia.org/wiki/Windows-1252
 - ISO/IEC_8859-1 : http://en.wikipedia.org/wiki/ISO/IEC_8859-1
 - character encoding http://w3techs.com/technologies/overview/character_encoding/all   

Practically, many free tools provide easy conversion to UTF 8.
Preliminary step:
SGML, XML, HTML have to be converted first to plain text (removal of the tags) 
and entities have to be resolved (è -> é,...).

For small plain text corpora, each file can be converted individually. 
Notepad ++ converts individuals files in 3 clics: 
(http://notepad-plus-plus.org/)
-check encoding : menu coding (encodage in french), if coding in UTF8 (encoder 
en UTF8)is checked, nothing to do. 
-else (ANSI coding), 
 --convert to UTF 8 (without BOM) -> coding in UTF8 (encoder en UTF8)is checked
 -- save the file

 For larger plain text corpora, batch processing is mandatory
  Sisulizer's Kaboom is a convenient free solution on Windows, any version.(use Multi-converter tab) 
    http://www.sisulizer.com/kaboom/kaboom.shtml
    http://www.sisulizer.com/support/downloads/download2.shtml?fn=kaboom 

Issue can be closed

Original comment by claude.m...@gmail.com on 26 Jun 2012 at 8:16