Resources not found when referencing external .jar files

NightOwl888 commented 5 months ago

The documentation for Stanford CoreNLP states that the POM configuration should be:

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>4.4.0</version>
    <classifier>models</classifier>
</dependency>

So, my project file looks like this:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IKVM" Version="8.7.4"/>
    <PackageReference Include="IKVM.Maven.Sdk" Version="1.6.7" />
  </ItemGroup>

  <ItemGroup>
    <MavenReference Include="edu.stanford.nlp:stanford-corenlp" Version="4.5.5" />
    <MavenReference Include="edu.stanford.nlp:stanford-corenlp" Version="4.5.5" Classifier="models" />
  </ItemGroup>

</Project>

When specifying MavenReference s this way, it successfully downloads the .jar files with the classifier models into the local Maven cache.

However, there is no build output for the stanford-corenlp-4.5.5-models.jar.

Then when I try to run a simple example, it cannot find the models.

using edu.stanford.nlp.pipeline;
using java.util;

namespace IkvmMavenMissingResourcesError
{
    internal class Program
    {
        static void Main(string[] args)
        {
            // Initialize Stanford CoreNLP for sentiment analysis
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
            var pipeline = new StanfordCoreNLP(props);
        }
    }
}

Error

edu.stanford.nlp.io.RuntimeIOException
  HResult=0x80131500
  Message=Error while loading a tagger model (probably missing model file)
  Source=stanford.corenlp
  StackTrace:
   at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
   at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile, Properties config, Boolean printLoading)
   at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile)
   at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(String loc, Boolean verbose)
   at edu.stanford.nlp.pipeline.POSTaggerAnnotator..ctor(String annotatorName, Properties props)
   at edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(Properties properties)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$6(Properties props, AnnotatorImplementations impl)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.__<>Anon7.apply(Object , Object )
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$33(Entry entry, Properties inputProps, AnnotatorImplementations annotatorImplementation)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.__<>Anon41.get()
   at edu.stanford.nlp.util.Lazy.3.compute()
   at edu.stanford.nlp.util.Lazy.get()
   at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements, AnnotatorPool annotatorPool)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props)
   at IkvmMavenMissingResourcesError.Program.Main(String[] args) in F:\Users\shad\source\repos\IkvmMavenMissingResourcesError\IkvmMavenMissingResourcesError\Program.cs:line 14

  This exception was originally thrown at this call stack:
    [External Code]

Inner Exception 1:
IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger" as class path, filename or URL

Looks like this PR https://github.com/sergey-tihon/Stanford.NLP.NET/pull/130/files#diff-9b0fd7e079a9dfbbaa7589009e76812239c92547bf40b6668d365b647207ed59R42-R57 has some workarounds for loading the resource files, which I will pursue. But it would be nice if this could be fixed so when adding MavenReference to the resources it would be able to discover them on its own.

wasabii commented 5 months ago

So, what happened here.... everything worked, except it determined the assembly name to be "stanford.corenlp.dll", same as the non-models classifer. Thus resulting in exactly named DLLs. Thus resulting in one replacing the other.

And we aren't incorporating the 'classifer' name into the DLL name anywhere, so that's expected. Wonder what we should do.

wasabii commented 5 months ago

I'm not sure we can do anything here.

The -models.jar file does not contain an Automatic-Module-Name entry, and thus it has to be inferred from the jar name. And the jar name is such that the inferred name, according to the OpenJDK specifications, is stanford.opennlp@4.5.5-models. Which overlaps with the inferred automatic module name of the main JAR which is stanford.opennlp@4.5.5. Same name, different version.

And of course we cannot allow a non-deterministic result for this, as it would mess up any future dependency hierachy.

I think a bug needs to be opened with upstream to follow recommendations and define a module name explicitely.

NightOwl888 commented 5 months ago

The classifier is just a string. In this case, there are other resource files that can also be added on, such as models-chinese or models-arabic, etc. I have seen other projects use a variety of other values for classifier. Some have special meanings (like sources or pom).

So, I imagine there needs to be special handling for the ones with special meanings, then the rest should be compiled into assemblies. IMO it makes more sense for them to be separate assemblies than the main jar, but not sure how well our classpath loader works between assemblies like that.

As for the naming, why not simply tack the classifier on the end (cleaned of special characters, of course)? It is guaranteed to be unique because it is within the unique name of the jar.

NightOwl888 commented 5 months ago

https://www.baeldung.com/maven-artifact-classifiers

NightOwl888 commented 5 months ago

As for following the spec - this doesn't seem any different than how we compile satellite assemblies in .NET. Sometimes it makes sense to separate resources physically from code, especially for localization.

wasabii commented 5 months ago

These are not satellite assemblies. From IKVM's point of view, they are simply JARs. JARs become assemblies. Java has no separate concept of satellite anything: it's all on the class/module path.

It is not really our choice as to what special handling can be added or not, as long as we stick to the JDK9+ specification. The algorithm is described on this page:

https://docs.oracle.com/javase/9/docs/api/java/lang/module/ModuleFinder.html

Our decision of using the JDK9+ module specification in the first place to determine assembly names, however, was our choice. But, some choice had to be made. And I'm not sure there was any other choice available that fulfilled the project goals.

wasabii commented 5 months ago

With guidance here:

https://dev.java/learn/modules/automatic-module/

It might look as if the critical mistake is to require a plain JAR by a module name that is based on its file name. But that's not generally the case - using this approach is perfectly fine for applications and in other scenarios where the developer has full control over the module descriptors requiring such automatic modules. No, mistake is to publish modules with such dependencies to a public repository. Only then can users come into a situation where a module implicitly depends on details that they have no control over and that can lead to additional work or even unresolvable divergences.

So you should never publish (to an openly accessible repository) modules that require a plain JAR without an Automatic-Module-Name entry in its manifest. Only with that entry are automatic module names sufficiently stable to rely on. Yes, that might mean that you can not yet publish a modularized version of your library or framework and must wait for your dependencies to add that entry. That's unfortunate, but doing it anyway would be a great disservice to your users.

wasabii commented 5 months ago

I should note, we had a similar situation with Apache Tika. They had some JAR file published which had an incorrect Automatic-Module-Name entry. It was a typo. They fixed it in 24 hours.

But at least they bothered to include entries!

NightOwl888 commented 5 months ago

So you should never publish (to an openly accessible repository) modules that require a plain JAR without an Automatic-Module-Name entry in its manifest. Only with that entry are automatic module names sufficiently stable to rely on.

So, you are saying this is the bug, and if they fix it, it works on our end?

wasabii commented 5 months ago

I would call it a bug. But it's not as clear cut as just being a defect. With JDK9, every JAR file publically published SHOULD have an automatic module name. But, IKVM doesn't implement much if anything from JDK9. But, we do pick this one bit, because we needed SOMETHING to deterministically predict a unique identifier for a JAR file within the Java ecosystem that we could piggyback on for assembly names. And this was available from JDK9 onward. Projects have since added it to their JAR files. Even for JDK8. So their JAR files can be considered modules when running on JDK9.

They should have an interest in fixing it. I would expect them to. It has ramifications outside IKVM.

The Tika people, for example, were very motivated to fix it.

wasabii commented 5 months ago

Ssee, for example, this query regarding of bugs related to AMNs in Apache projects:

https://issues.apache.org/jira/browse/CURATOR-550?jql=text%20~%20%22automatic%20module%22

Everybody has motivation to add them and get them right.

wasabii commented 5 months ago

In a sense though, there is a natural conflict here. Maven conventions are for a specific file name pattern for classifiers. This pattern conflicts with the Automatic-Module-Name discovery mechanism in JDK9, making it impossible to infer a unique module name. Thing is, it's not OUR conflict. It's not like any JVM could infer a unique module name either.

The real world issue is just a matter of the extent to which some tool requires a unique module name or not. We do. But, so do others.

NightOwl888 commented 5 months ago

Automatic-Module-Name is missing from the "plain jar" as well as the other jars. So, this will need to be put into all of them.

In a sense though, there is a natural conflict here. Maven conventions are for a specific file name pattern for classifiers. This pattern conflicts with the Automatic-Module-Name discovery mechanism in JDK9, making it impossible to infer a unique module name. Thing is, it's not OUR conflict. It's not like any JVM could infer a unique module name either.

The real world issue is just a matter of the extent to which some tool requires a unique module name or not. We do. But, so do others.

So, do we specifically need to spell out that we need each Automatic-Module-Name entry to be unique, or if they follow the spec will it work out that way anyway? I just want to be sure I understand what the recommended fix is before reporting it to them.

wasabii commented 5 months ago

I'm usually something like:

The Stanford OpenNLP JAR files from Maven that I have examine thus far lack a JDK9 Automatic-Module-Name entry in their MANIFEST.MF files and are thus undiscoverable as unique modules by tooling that expects module names complying with this specification.

As mentioned https://dev.java/learn/modules/automatic-module/ and in many other places, it is ideal that an Automatic-Module-Name entry be included in JAR files that are published publically, so that tooling that requires it can operate properly.

In this particular case, for example, tooling is unable to locate the Automatic-Module-Name entry and thus falls back to the 'inference' specification described at https://docs.oracle.com/javase/9/docs/api/java/lang/module/ModuleFinder.html. For the JARs published to Maven under the 'model' classier, these files names (stanford-opennlp-4.5.5-models.jar, etc) would be infered to possses the module name of "stanford.opennlp", which would overlap with the module name of the core libray (stanford-opennlp-4.5.5.jar), thus causing a duplicate. This could be resolved by including an explicit entry.

Something like that.

wasabii commented 5 months ago

For a more indepth analysis of the overall issue, see: https://blog.joda.org/2017/05/java-se-9-jpms-automatic-modules.html

NightOwl888 commented 5 months ago

Stanford CoreNLP 4.5.6 now uses Automatic-Module-Name in their packages. So, I updated the project:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IKVM" Version="8.7.5"/>
    <PackageReference Include="IKVM.Maven.Sdk" Version="1.6.7" />
  </ItemGroup>

  <ItemGroup>
    <MavenReference Include="edu.stanford.nlp:stanford-corenlp" Version="4.5.6" />
    <MavenReference Include="edu.stanford.nlp:stanford-corenlp" Version="4.5.6" Classifier="models" />
  </ItemGroup>

</Project>

And now both DLLs get generated:

But IKVM still doesn't see the resources from the models when calling into the main DLL.

edu.stanford.nlp.io.RuntimeIOException
  HResult=0x80131500
  Message=Error while loading a tagger model (probably missing model file)
  Source=edu.stanford.nlp.corenlp
  StackTrace:
   at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(Properties config, String modelFileOrUrl, Boolean printLoading)
   at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile, Properties config, Boolean printLoading)
   at edu.stanford.nlp.tagger.maxent.MaxentTagger..ctor(String modelFile)
   at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(String loc, Boolean verbose)
   at edu.stanford.nlp.pipeline.POSTaggerAnnotator..ctor(String annotatorName, Properties props)
   at edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(Properties properties)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$6(Properties props, AnnotatorImplementations impl)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.__<>Anon7.apply(Object , Object )
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$33(Entry entry, Properties inputProps, AnnotatorImplementations annotatorImplementation)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.__<>Anon41.get()
   at edu.stanford.nlp.util.Lazy.3.compute()
   at edu.stanford.nlp.util.Lazy.get()
   at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements, AnnotatorPool annotatorPool)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props)
   at IkvmMavenMissingResourcesError.Program.Main(String[] args) in F:\Users\shad\source\repos\IkvmMavenMissingResourcesError\IkvmMavenMissingResourcesError\Program.cs:line 36

  This exception was originally thrown at this call stack:
    [External Code]

Inner Exception 1:
IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger" as class path, filename or URL

I am still using the original 3 lines of code that caused the error.

using edu.stanford.nlp.pipeline;
using java.util;

namespace IkvmMavenMissingResourcesError
{
    internal class Program
    {
        static void Main(string[] args)
        {
            // Initialize Stanford CoreNLP for sentiment analysis
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
            var pipeline = new StanfordCoreNLP(props);
        }
    }
}

So, it looks like the default class loader still doesn't emulate what Java does.

wasabii commented 5 months ago

Did you preload the assembly?

NightOwl888 commented 5 months ago

Preload? No. All of the code is posted above. Are you saying this is a requirement?

wasabii commented 5 months ago

Always has been if there is no direct type reference.

GeorgeS2019 commented 4 months ago

@NightOwl888

Could you get the Stanford CoreNLP 4.5.6 to work the marven way by preloading assembly?

NightOwl888 commented 4 months ago

@GeorgeS2019

No, it still doesn't work. However, I recall getting a different error message previously than I am getting now.

I didn't report it previously because I wanted to test it in Java to make sure the issue is because of IKVM, but I am fairly certain it is or someone would have reported it to the Stanford CoreNLP issue tracker by now.

My Program.cs file now looks like this:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net6.0</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IKVM" Version="8.7.5"/>
    <PackageReference Include="IKVM.Maven.Sdk" Version="1.6.7" />
  </ItemGroup>

  <ItemGroup>
    <MavenReference Include="edu.stanford.nlp:stanford-corenlp" Version="4.5.6" />
    <MavenReference Include="edu.stanford.nlp:stanford-corenlp" Version="4.5.6" Classifier="models" />
  </ItemGroup>

</Project>

using edu.stanford.nlp.pipeline;
using java.util;
using System;
using System.IO;
using System.Reflection;

namespace IkvmMavenMissingResourcesError
{
    internal class Program
    {
        static void Main(string[] args)
        {
            Assembly.LoadFile(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "jollyday.dll"));

            // Load the resource assemblies
            string assemblyName = "edu.stanford.nlp.corenlp_english_models, Version=4.5.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58";
            Assembly.Load(new AssemblyName(assemblyName));

            // Initialize Stanford CoreNLP for sentiment analysis
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment");
            var pipeline = new StanfordCoreNLP(props);
        }
    }
}

Error creating edu.stanford.nlp.time.TimeExpressionExtractorImpl
   at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(String className, Object[] arguments)
   at edu.stanford.nlp.time.TimeExpressionExtractorFactory.create(String className, String name, Properties props)
   at edu.stanford.nlp.time.TimeExpressionExtractorFactory.createExtractor(String name, Properties props)
   at edu.stanford.nlp.ie.regexp.NumberSequenceClassifier..ctor(Properties props, Boolean useSUTime, Properties sutimeProps)
   at edu.stanford.nlp.ie.NERClassifierCombiner..ctor(Boolean applyNumericClassifiers, Language nerLanguage, Boolean useSUTime, Properties nscProps, String[] loadPaths)
   at edu.stanford.nlp.pipeline.NERCombinerAnnotator..ctor(Properties properties)
   at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(Properties properties)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$8(Properties props, AnnotatorImplementations impl)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.__<>Anon9.apply(Object , Object )
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$33(Entry entry, Properties inputProps, AnnotatorImplementations annotatorImplementation)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP.__<>Anon41.get()
   at edu.stanford.nlp.util.Lazy.3.compute()
   at edu.stanford.nlp.util.Lazy.get()
   at edu.stanford.nlp.pipeline.AnnotatorPool.get(String name)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements, AnnotatorPool annotatorPool)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props, Boolean enforceRequirements)
   at edu.stanford.nlp.pipeline.StanfordCoreNLP..ctor(Properties props)
   at IkvmMavenMissingResourcesError.Program.Main(String[] args) in F:\Users\shad\source\repos\IkvmMavenMissingResourcesError\IkvmMavenMissingResourcesError\Program.cs:line 48

Previously, I was getting an error that originated in jollyday.dll, but even after explicitly loading it the error persisted. But now it seems to have made it a bit further before it bombed. It still is having trouble loading a class that accesses resources.

I have a demo showing loading all of the resource files explicitly that works. We shouldn't have to do that, though. The whole point of putting the resources in the .jar file is to have a default set of resources that load automatically just by including the package.

GeorgeS2019 commented 4 months ago

@NightOwl888

Thanks for your initial codes

version 4.5.6 is solved now

https://github.com/NightOwl888/lucenenet-opennlp-mavenreference-demo/issues/1#issuecomment-2114600761

ikvmnet / ikvm-maven

Resources not found when referencing external .jar files #51

Error