Open GoogleCodeExporter opened 9 years ago
Thanks for your proposal.
It is to enable users to select necessary language profiles why langdetect
separates ones.
So to adopt your proposal, I'm afraid langdetect needs to provide both jars
with and without profiles...
But it is easier to use library including profiles, as you say...
Original comment by nakatani.shuyo
on 18 Feb 2011 at 3:42
Dear Nakatani,
You can still provide langdetect in a single jar, as you do it now.
However, I package language profiles inside a jar in my application. And using
the code I provide, I can access them without unzipping the jar.
So as you say, it will be easier to use your library.
Thanks!
Original comment by kaspar.f...@gtempaccount.com
on 18 Feb 2011 at 10:50
I see. Then I'll try to include your proposal. Thanks!
Original comment by nakatani.shuyo
on 21 Feb 2011 at 3:16
I've added loadProfile(File) into DetectorFactory and commited new
langdetect.jar so I'd like to keep calling File constructor once.
http://code.google.com/p/language-detection/source/browse/trunk/lib/langdetect.jar
I think you can do like your report as the following.
DetectorFactory.loadProfile(new File(MyClass.class.getResource("profiles").toURI()));
Would you like it?
Original comment by nakatani.shuyo
on 24 Feb 2011 at 7:56
im having trouble loading the profile.
in netbeans where do I suppose to put the profile folder?
Thanks.
Original comment by jkoe...@gmail.com
on 4 May 2011 at 8:49
I don't use netbeans...
Could you specify the absolute path of profile directory for
DetectorFactory.loadProfile?
Original comment by nakatani.shuyo
on 6 May 2011 at 3:47
Currently if DetectorFactory.loadProfile() of any form has not been called,
then detector.detect() throws an exception.
I would suggest to have detect() fall back to the default profiles directory
packaged with the library. It gives excellent results out of the box, suitable
for most of the cases. (Now I have to copy the profiles directory to my
project's resources to figure out the path and make sure that it stays in place
if the final thing is packaged differently.)
Attached java file contains the proposed modification to
DetectorFactory.createDetector():
static private Detector createDetector() throws LangDetectException {
if (instance_.langlist.size()==0) {
try {
// Fall back to the default profiles
loadProfile(new File(Detector.class.getResource("/profiles").toURI()));
} catch (URISyntaxException e) {
// Next clause will through the exception
}
}
if (instance_.langlist.size()==0)
throw new LangDetectException(ErrorCode.NeedLoadProfileError, "need to load profiles");
Detector detector = new Detector(instance_);
return detector;
}
Also I added 2 create() factory methods:
* static public Detector create(String text)
* static public Detector create(Reader reader)
These modifications would allow to detect the language with a single call:
String language = DetectorFactory.create(text).detect();
Original comment by vasilievsi@gmail.com
on 7 Jul 2011 at 9:46
Attachments:
An alternative way is to leave createDetector() as it is, and add a static
initialization (attached):
...
static private DetectorFactory instance_ = new DetectorFactory();
static {
try {
// Load default profiles
loadProfile(new File(Detector.class.getResource("/profiles").toURI()));
} catch (URISyntaxException e) {
// If default profiles failed to load, other profiles can be loaded later
} catch (LangDetectException e) {
}
}
Original comment by vasilievsi@gmail.com
on 7 Jul 2011 at 10:16
Attachments:
langdetect has some reasons to adopt the current interface.
At first, it is because I was not quite satisfied with other libraries which
bundle profiles in the jar file.
So there are even the default language profiles outside its jar file.
And I had considered it has to provide Java-like interface, so creating an
instance and detecting languages are separated.
But I understand what you want to do, hence I also like some functional
languages and Ruby and so on. :D
Original comment by nakatani.shuyo
on 11 Jul 2011 at 11:09
Right, good to have an interface implemented (though I see none yet). But I'm
not trying to change the interface, just adding a couple of create() methods,
like you have overloaded loadProfile(), nothing more, just one more way of
doing things. It's all about usability.
To the second part, Java libraries are used outside Java too.
Cheers, and good luck to your project,
Sergei
Original comment by vasilievsi@gmail.com
on 13 Jul 2011 at 10:05
Hi Nakatani-san,
Thank you for the nice software.
I needed to package up langdetect class files, profiles and additional class
files into one jar in order to run a hadoop job. In that kind of a scenario,
loadProfile(File) isn't enough; a File object cannot refer to a file inside a
jar.
(And I understood your policy of not providing both jars with/without a
profile. That's perfectly fine.)
Given that, I'd like to share a work around I used for my task:
1. Copy the profiles dir to any directory under classpath.
Since I'm a maven user, I put it here: src/main/resources/profiles
2. Add the following two methods to DetectorFactory.
private static List<String> getProfileNames(String resourceName) throws IOException {
List<String> profileNames = new ArrayList<String>();
InputStream is = DetectorFactory.class.getResourceAsStream("/" + resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line = null;
while ((line = br.readLine()) != null) {
if (!line.startsWith(".")) {
profileNames.add(line);
}
}
br.close();
is.close();
return profileNames;
}
//Mostly same as the original loadProfile method.
public static void loadProfileFromClasspath(String resource) throws LangDetectException {
try {
List<String> profileNames = getProfileNames(resource);
int langsize = profileNames.size(), index = 0;
for (String profileName : profileNames) {
InputStream is = null;
try {
is = DetectorFactory.class.getResourceAsStream(resource + "/" + profileName);
LangProfile profile = JSON.decode(is, LangProfile.class);
addProfile(profile, index, langsize);
++index;
} catch (JSONException e) {
throw new LangDetectException(ErrorCode.FormatError, "profile format error in '"
+ profileName + "'");
} catch (IOException e) {
throw new LangDetectException(ErrorCode.FileLoadError, "can't open '" + profileName + "'");
} finally {
try {
if (is != null)
is.close();
} catch (IOException e) {
}
}
}
} catch (Exception e) {
throw new LangDetectException(ErrorCode.NeedLoadProfileError,
"Not found profile in classpath: " + resource);
}
}
3. Build and package.
For instance by using a maven target "package assembly:single".
4. Use the new method to load the profiles stored inside the jar (under
classpath).
DetectorFactory.loadProfileFromClasspath( "profiles" );
Detector detector = DetectorFactory.create();
detector.append( "Hello world, this is a test." );
System.out.println( detector.detect() );
I haven't tried the loadProfile(URI) method proposed above, because I wasn't
aware of this discussion page at the time I did a work around. Anyway, the goal
seems to be the same, and I hope this also helps someone!
Best,
-Hideki Shima
Original comment by hideki.shima
on 17 Nov 2011 at 7:02
Thank for your sample.
I've requested the same problem for Hadoop at several times, so I begin
wondering whether I should support profile-bundled jar... :D
language-detection supported loadProfiles(List<String>) at trunk of the
repository.
You might implement more easily in using the method.
http://code.google.com/p/language-detection/issues/detail?id=24
Original comment by nakatani.shuyo
on 22 Nov 2011 at 7:46
ok, it worked, but i needed to include the library jsonic-1.2.0.jar and i put
the folder profiles in lib, so only NAMECLASS.init("lib/profiles"). I think,
it'd better to put profiles in library and then delete funtion
.loadProfile(String or File).
Best, leho
Original comment by lehotoms...@gmail.com
on 28 Mar 2012 at 5:32
Attachments:
If anybody is still wondering what a working example from the most recent maven
build would look like, please refer to the attached file here.
Original comment by anto...@cloudangels.com
on 17 Jan 2013 at 11:11
Attachments:
[deleted comment]
Here's a solution in Clojure FWIW; though it doesn't autodetect the language
profiles that are available, I actually like having an easy reference of all of
the languages that might be detected in the codebase:
(->> #{"af" "ar" "bg" "bn" "cs" "da" "de" "el" "en" "es" "et" "fa" "fi" "fr"
"gu"
"he" "hi" "hr" "hu" "id" "it" "ja" "kn" "ko" "lt" "lv" "mk" "ml" "mr" "ne"
"nl" "no" "pa" "pl" "pt" "ro" "ru" "sk" "sl" "so" "sq" "sv" "sw" "ta" "te"
"th" "tl" "tr" "uk" "ur" "vi" "zh-cn" "zh-tw"}
(map (partial str "profiles/"))
(map (comp slurp clojure.java.io/resource))
com.cybozu.labs.langdetect.DetectorFactory/loadProfile)
Original comment by c...@cemerick.com
on 25 Apr 2013 at 3:04
The file suggestion from Anto (#14) worked like a charm for profiles inside a
jar
Original comment by h...@perrohunter.com
on 3 Dec 2014 at 8:54
Original issue reported on code.google.com by
kaspar.f...@gtempaccount.com
on 17 Feb 2011 at 7:16