malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.
Apache License 2.0
5 stars 5 forks source link

Non Wikipedia corpus for profile generation #23

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,

I want to ask you if it is possible to use something else than the wiki 
abstract you tell in the wiki to generate the language profile. To be more 
precise, i would like to know if it is possible to use the Europarl parallel 
corpus (dedicated to the EU languages but continuously improved and realigned).

Is it useful to generate/regenerate profiles with such corpus? Or the wiki 
extracts are sufficient?

This corpus is available at http://www.statmt.org/europarl/

In order to not download the 1.3Gb of data, let me introduce the two kinds of 
files it contains:
1) One huge file (for example 65Mb for LV) containing lines of text in the 
corresponding language - perhaps the easiest corpus to work with - but what 
about Java memory exception?) ;
2) Several (many) little files containing 'bad' XML (opened tags with no 
corresponding closing tags);

Regards,
Emmanuel

P.S. : sorry to spam you today with my 3 issues but your API is really useful 
and fast enough to fit our constraints ;-)

Original issue reported on code.google.com by zygolech...@gmail.com on 29 Aug 2011 at 3:01

GoogleCodeExporter commented 9 years ago
Hello Emmanuel,
Very thanks for your trying langdetect library.

I'm just planning to provide some language profiles based on other courpus(news 
and so on), so I'll try also the Europarl parallel corpus.
I reckon there are no troubles of memory because langdetect treats for each 
line once! :D

Thanks!

Original comment by nakatani.shuyo on 30 Aug 2011 at 2:26

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r101.

Original comment by nakatani.shuyo on 8 Sep 2011 at 10:30

GoogleCodeExporter commented 9 years ago
This issue was closed by revision ab6124b1f9ae.

Original comment by nakatani.shuyo on 12 Jan 2012 at 9:47

GoogleCodeExporter commented 9 years ago
Hello,
I am new to langdetect. I have been tying to generate language profiles for new 
languages but after executing the command successfully without error I see that 
no profile file is created! I have tried to include an option -d'profile 
directory' but still no file gets created. Kindly help.

Thanks.

N. Akosu

Original comment by nickak...@gmail.com on 25 Aug 2013 at 2:52