thienbui / language-detection

Automatically exported from code.google.com/p/language-detection
1 stars 0 forks source link

Extend DetectorFactory.loadProfile() so that it works with profiles in JAR files #9

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I would like to propose loadProfile() to be overloaded as follows in order to 
make it possible to have the profiles/ directory stored in a JAR file and load 
its content via something like this:

      DetectorFactory.loadProfile(MyClass.class.getResource("profiles").toURI());

Here is my proposal:

  /**
   * Load profiles from specified directory. This method (or its overloaded companion) must be called once before language
   * detection.
   * 
   * @param profileDirectory
   *          profile directory path
   * @throws LangDetectException
   *           Can't open profiles(error code = {@link ErrorCode#FileLoadError}) or profile's format
   *           is wrong (error code = {@link ErrorCode#FormatError})
   */
  public static void loadProfile(String profileDirectory) throws LangDetectException
  {
    loadProfile(new File(profileDirectory).toURI());
  }

  /**
   * Load profiles from specified directory. This method (or its overloaded companion) must be called once before language
   * detection.
   * 
   * @param profileDirectory
   *          profile directory path as a URI
   * @throws LangDetectException
   *           Can't open profiles(error code = {@link ErrorCode#FileLoadError}) or profile's format
   *           is wrong (error code = {@link ErrorCode#FormatError})
   */
  public static void loadProfile(URI profileDirectory) throws LangDetectException
  {
    File dir = new File(profileDirectory);
    File[] listFiles = dir.listFiles();
    if (listFiles == null)
      throw new LangDetectException(ErrorCode.NeedLoadProfileError, "Not found profile directory: " + profileDirectory);

    int langsize = listFiles.length, index = 0;
    for (File file : listFiles)
    {
      if (file.getName().startsWith(".") || !file.isFile()) continue;
      FileInputStream is = null;
      try
      {
        is = new FileInputStream(file);
        LangProfile profile = JSON.decode(is, LangProfile.class);
        addProfile(profile, index, langsize);
        ++index;
      }
      catch (JSONException e)
      {
        throw new LangDetectException(ErrorCode.FormatError, "profile format error in '" + file.getName() + "'");
      }
      catch (IOException e)
      {
        throw new LangDetectException(ErrorCode.FileLoadError, "can't open '" + file.getName() + "'");
      }
      finally
      {
        try
        {
          if (is != null) is.close();
        }
        catch (IOException e)
        {
        }
      }
    }
  }

Original issue reported on code.google.com by kaspar.f...@gtempaccount.com on 17 Feb 2011 at 7:16

GoogleCodeExporter commented 9 years ago
Thanks for your proposal.

It is to enable users to select necessary language profiles why langdetect 
separates ones.
So to adopt your proposal, I'm afraid langdetect needs to provide both jars 
with and without profiles...

But it is easier to use library including profiles, as you say...

Original comment by nakatani.shuyo on 18 Feb 2011 at 3:42

GoogleCodeExporter commented 9 years ago
Dear Nakatani,

You can still provide langdetect in a single jar, as you do it now.

However, I package language profiles inside a jar in my application. And using 
the code I provide, I can access them without unzipping the jar.

So as you say, it will be easier to use your library.

Thanks!

Original comment by kaspar.f...@gtempaccount.com on 18 Feb 2011 at 10:50

GoogleCodeExporter commented 9 years ago
I see. Then I'll try to include your proposal. Thanks!

Original comment by nakatani.shuyo on 21 Feb 2011 at 3:16

GoogleCodeExporter commented 9 years ago
I've added loadProfile(File) into DetectorFactory and commited new 
langdetect.jar so I'd like to keep calling File constructor once.

    http://code.google.com/p/language-detection/source/browse/trunk/lib/langdetect.jar

I think you can do like your report as the following.

    DetectorFactory.loadProfile(new File(MyClass.class.getResource("profiles").toURI()));

Would you like it?

Original comment by nakatani.shuyo on 24 Feb 2011 at 7:56

GoogleCodeExporter commented 9 years ago
im having trouble loading the profile. 
in netbeans where do I suppose to put the profile folder? 

Thanks. 

Original comment by jkoe...@gmail.com on 4 May 2011 at 8:49

GoogleCodeExporter commented 9 years ago
I don't use netbeans...
Could you specify the absolute path of profile directory for 
DetectorFactory.loadProfile?

Original comment by nakatani.shuyo on 6 May 2011 at 3:47

GoogleCodeExporter commented 9 years ago
Currently if DetectorFactory.loadProfile() of any form has not been called, 
then detector.detect() throws an exception. 

I would suggest to have detect() fall back to the default profiles directory 
packaged with the library. It gives excellent results out of the box, suitable 
for most of the cases. (Now I have to copy the profiles directory to my 
project's resources to figure out the path and make sure that it stays in place 
if the final thing is packaged differently.)

Attached java file contains the proposed modification to 
DetectorFactory.createDetector():

static private Detector createDetector() throws LangDetectException {
    if (instance_.langlist.size()==0) {
        try {
            // Fall back to the default profiles
            loadProfile(new File(Detector.class.getResource("/profiles").toURI()));
        } catch (URISyntaxException e) {
            // Next clause will through the exception
        }
    }
    if (instance_.langlist.size()==0)
        throw new LangDetectException(ErrorCode.NeedLoadProfileError, "need to load profiles");
    Detector detector = new Detector(instance_);
    return detector;
}

Also I added 2 create() factory methods:

* static public Detector create(String text)
* static public Detector create(Reader reader)

These modifications would allow to detect the language with a single call:

String language = DetectorFactory.create(text).detect();

Original comment by vasilievsi@gmail.com on 7 Jul 2011 at 9:46

Attachments:

GoogleCodeExporter commented 9 years ago
An alternative way is to leave createDetector() as it is, and add a static 
initialization (attached):

...
    static private DetectorFactory instance_ = new DetectorFactory();
    static {
        try {
            // Load default profiles
            loadProfile(new File(Detector.class.getResource("/profiles").toURI()));
        } catch (URISyntaxException e) {
            // If default profiles failed to load, other profiles can be loaded later 
        } catch (LangDetectException e) {
        }
    }

Original comment by vasilievsi@gmail.com on 7 Jul 2011 at 10:16

Attachments:

GoogleCodeExporter commented 9 years ago
langdetect has some reasons to adopt the current interface.

At first, it is because I was not quite satisfied with other libraries which 
bundle profiles in the jar file.
So there are even the default language profiles outside its jar file.

And I had considered it has to provide Java-like interface, so creating an 
instance and detecting languages are separated.
But I understand what you want to do, hence I also like some functional 
languages and Ruby and so on. :D

Original comment by nakatani.shuyo on 11 Jul 2011 at 11:09

GoogleCodeExporter commented 9 years ago
Right, good to have an interface implemented (though I see none yet). But I'm 
not trying to change the interface, just adding a couple of create() methods, 
like you have overloaded loadProfile(), nothing more, just one more way of 
doing things. It's all about usability.

To the second part, Java libraries are used outside Java too.

Cheers, and good luck to your project,
Sergei

Original comment by vasilievsi@gmail.com on 13 Jul 2011 at 10:05

GoogleCodeExporter commented 9 years ago
Hi Nakatani-san,

Thank you for the nice software.

I needed to package up langdetect class files, profiles and additional class 
files   into one jar in order to run a hadoop job. In that kind of a scenario, 
loadProfile(File) isn't enough; a File object cannot refer to a file inside a 
jar.

(And I understood your policy of not providing both jars with/without a 
profile. That's perfectly fine.)

Given that, I'd like to share a work around I used for my task:

1. Copy the profiles dir to any directory under classpath.

Since I'm a maven user, I put it here: src/main/resources/profiles

2. Add the following two methods to DetectorFactory.

  private static List<String> getProfileNames(String resourceName) throws IOException {
    List<String> profileNames = new ArrayList<String>();
    InputStream is = DetectorFactory.class.getResourceAsStream("/" + resourceName);
    BufferedReader br = new BufferedReader(new InputStreamReader(is));
    String line = null;
    while ((line = br.readLine()) != null) {
      if (!line.startsWith(".")) {
        profileNames.add(line);
      }
    }
    br.close();
    is.close();
    return profileNames;
  }

  //Mostly same as the original loadProfile method.
  public static void loadProfileFromClasspath(String resource) throws LangDetectException {
    try {
      List<String> profileNames = getProfileNames(resource);

      int langsize = profileNames.size(), index = 0;
      for (String profileName : profileNames) {
        InputStream is = null;
        try {
          is = DetectorFactory.class.getResourceAsStream(resource + "/" + profileName);
          LangProfile profile = JSON.decode(is, LangProfile.class);
          addProfile(profile, index, langsize);
          ++index;
        } catch (JSONException e) {
          throw new LangDetectException(ErrorCode.FormatError, "profile format error in '"
                  + profileName + "'");
        } catch (IOException e) {
          throw new LangDetectException(ErrorCode.FileLoadError, "can't open '" + profileName + "'");
        } finally {
          try {
            if (is != null)
              is.close();
          } catch (IOException e) {
          }
        }
      }
    } catch (Exception e) {
      throw new LangDetectException(ErrorCode.NeedLoadProfileError,
              "Not found profile in classpath: " + resource);
    }
  }

3. Build and package.

For instance by using a maven target "package assembly:single".

4. Use the new method to load the profiles stored inside the jar (under 
classpath).

  DetectorFactory.loadProfileFromClasspath( "profiles" );
  Detector detector = DetectorFactory.create();
  detector.append( "Hello world, this is a test." );
  System.out.println( detector.detect() );

I haven't tried the loadProfile(URI) method proposed above, because I wasn't 
aware of this discussion page at the time I did a work around. Anyway, the goal 
seems to be the same, and I hope this also helps someone!

Best,

-Hideki Shima

Original comment by hideki.shima on 17 Nov 2011 at 7:02

GoogleCodeExporter commented 9 years ago
Thank for your sample.
I've requested the same problem for Hadoop at several times, so I begin 
wondering whether I should support profile-bundled jar... :D

language-detection supported loadProfiles(List<String>) at trunk of the 
repository.
You might implement more easily in using the method.

http://code.google.com/p/language-detection/issues/detail?id=24

Original comment by nakatani.shuyo on 22 Nov 2011 at 7:46

GoogleCodeExporter commented 9 years ago
ok, it worked, but i needed to include the library jsonic-1.2.0.jar and i put 
the folder profiles in lib, so only NAMECLASS.init("lib/profiles"). I think, 
it'd better to put profiles in library and then delete funtion 
.loadProfile(String or File).
Best, leho

Original comment by lehotoms...@gmail.com on 28 Mar 2012 at 5:32

Attachments:

GoogleCodeExporter commented 9 years ago
If anybody is still wondering what a working example from the most recent maven 
build would look like, please refer to the attached file here.

Original comment by anto...@cloudangels.com on 17 Jan 2013 at 11:11

Attachments:

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Here's a solution in Clojure FWIW; though it doesn't autodetect the language 
profiles that are available, I actually like having an easy reference of all of 
the languages that might be detected in the codebase:

(->> #{"af" "ar" "bg" "bn" "cs" "da" "de" "el" "en" "es" "et" "fa" "fi" "fr" 
"gu"
       "he" "hi" "hr" "hu" "id" "it" "ja" "kn" "ko" "lt" "lv" "mk" "ml" "mr" "ne"
       "nl" "no" "pa" "pl" "pt" "ro" "ru" "sk" "sl" "so" "sq" "sv" "sw" "ta" "te"
       "th" "tl" "tr" "uk" "ur" "vi" "zh-cn" "zh-tw"}
     (map (partial str "profiles/"))
     (map (comp slurp clojure.java.io/resource))
     com.cybozu.labs.langdetect.DetectorFactory/loadProfile)

Original comment by c...@cemerick.com on 25 Apr 2013 at 3:04

GoogleCodeExporter commented 9 years ago
The file suggestion from Anto (#14) worked like a charm for profiles inside a 
jar

Original comment by h...@perrohunter.com on 3 Dec 2014 at 8:54