google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

CRFSuite: Creating standalone runnable jars with self-trained models #234

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hi I looked into the tasks necessary to compile a standalone runnable jar from 
a trained model.

Here the main TODOs I see:

TODO:
Placing created files/resources within the project-folder to be available if 
the uber-jar is executed
Main effort is providing all resources to a location the project-folder 
(model-related, but also feature-related resources)
At the moment it is unknown which parameter are "resource paths"(e.g. use 
resources) and which are just constants

 + Introducing a convention for marking such resource-parameters would be helpful to automatically collect the resources 

 + Adding an own "uber-jar preparation Task" which does this exact copying/preparing before the jar is build by "mvn install" would be convenient  

 Saving: 
 + SaveModelCRFSuiteBatchTask enables a free choice where the serialized model-files shall be stored
 How about enforcing (!) storing of this file within the project that makes the loading in the uber-jar scenario easier (e.g. everything has to be in the project (src/main/resources ?))

 + parameters.txt
 This file contains the parameters of the features including paths to needed resources
 Only absolute paths that are scattered over the HDD of the user
 The "lucenceDir"-parameter for instance points to a folder relative to DKPRO_HOME
 Other resources might lie wherever they are placed by the user

 Loading:
 + The TcAnnotatorSequence class requires a parameter PARAM_TC_MODEL_LOCATION
 This folder should point to a folder in the project and contain all resources

 + All resources have to be copied to temporary files before they can be used (e.g. ClassLoader-> getAsResourceStream or something)
 + Loading routines have to be extended accordingly to support the copy-operation

 Anyone any thoughts / comments on this. Especially the "marking" resource paths and/or adding conventions where to store files that they are loadable from the final jar

Original issue reported on code.google.com by Tobias.H...@gmail.com on 7 Mar 2015 at 1:24

GoogleCodeExporter commented 9 years ago
Why do we need an uber-jar?

Original comment by richard.eckart on 7 Mar 2015 at 5:24

GoogleCodeExporter commented 9 years ago
It would be nice to make some things to an actual "product" and provide a 
runnable tool (aka uber-jar) that just works if you give it some input.

We would like to release an own PoS-Tagger for social media in the future - 
without expecting that the user is a programmer that knows Eclipse, TC and all 
that stuff.

It is a bit of work, but if you can release your research as standalone 
runnable product with all dependencies and stuff being take care of - it would 
be quite an advantage for TC. 

Original comment by Tobias.H...@gmail.com on 7 Mar 2015 at 5:30

GoogleCodeExporter commented 9 years ago
As a developer of DKPro Core, I'm interested in tools that can be integrated 
easily into UIMA pipelines. Uber-JARs are notoriously problematic because they 
are highly likely to conflict with other classes on the classpath. Thus, I'd be 
more interested something a step short of an uber-jar: a model and (if 
necessary) a "light" JARs that I can wrap and integrated as an UIMA component - 
or that already is a DKPro compatible UIMA component and can be added to a 
pipeline directly.

I suppose based on that, I could always use the Maven shade plugin to create an 
uber-jar if I wanted. We did that already with DKPro Core pipelines. 
Alternatively, I could build a Groovy script that downloads the stuff from 
repositories.

Original comment by richard.eckart on 7 Mar 2015 at 5:35

GoogleCodeExporter commented 9 years ago
Yes, we already have a model loading pipeline feature for created model files. 
If you create an uber-jar from such a pipeline that loads a model it rains 
"file-path not found exception" because all paths are absolute which is the key 
problem if you try to use them from within the jar file.

I have only used the uber-jars so far, I thought it is the only way to make 
something "runnable" outside of Eclipse?
Including the resources by downloading them is a possibility, but this sounds 
like more effort to me. I thought about something were you provide once all the 
stuff, prepare things for wrapping into the uber-jar and then just be done. 
I am not sure how severe the dependency clash problem for ueber-jar are. 

Original comment by Tobias.H...@gmail.com on 7 Mar 2015 at 6:01

GoogleCodeExporter commented 9 years ago
uber-jar clashes are hell. E.g. we could not make use of the TWSI library that 
was provided as a uber-jar in any pipeline together with the Stanford tools, 
because TWSI included a copy of the Stanford classes.

Uber-jars are nice for stand-alone applications - but they are completely 
unusable within larger pipelines.

Of course models should not use absolute paths, at least not absolute paths 
with respect to the file system. You can use absolute paths within the 
classpath - that is what DKPro Core is doing all the time. It just needs to be 
made sure that a proper package structure is used - i.e. that models are not 
simply stored e.g. as "models/en.bin" in a JAR because that will also cause 
clashes.

The "export as JAR" in Eclipse never worked well for me because of the way that 
uimaFIT handles type detection.
For this reason, I am building runnable JARs using the maven-shade-plugin:

http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#ugr.tools.uimaf
it.packaging

Also nice are the Groovy scripts that we have in DKPro Core. They use Maven 
dependencies for components and the DKPro Core model auto-loading mechanism for 
loading models (although models could equally be added as Maven dependencies to 
the scripts - we just do not do it because it saves some lines of code):

https://code.google.com/p/dkpro-core-asl/wiki/DraftGroovyIntro

Such scripts as not as fully standalone as a uber-jar because they still 
require that they can access a Maven repository, but on the other hand they are 
really nice and short and serve as good examples.

I think it would be good to solve the file loading problems before taking the 
next step of creating an uber-jar - and for creating an uber-jar I would 
strongly recommend the approach described in the uimaFIT documentation 
mentioned above.

What do you think about setting up a wiki page to write up a specification 
mentioning the requirements and envisioned solutions? While discussing here, we 
might easily loose track of what we actually want, why, and how we imagine to 
solve it.

Original comment by richard.eckart on 8 Mar 2015 at 8:32

GoogleCodeExporter commented 9 years ago
I think I meant the maven-shaded way when I was talking about Ueber-jar, 
apparently not the same ?

The other Wiki pages are all of the kind "how to use TC". I see no discussion 
pages for work in progress?

Original comment by Tobias.H...@gmail.com on 8 Mar 2015 at 12:18

GoogleCodeExporter commented 9 years ago
There are various ways of building uber-jars. The maven-shade-plugin is special 
in the sense that it can be configured to properly handle (merge) certain 
configuration files that reside in well-known places in the classpath. Other 
uber-jar builders tend to fail to handle such files properly.

Just create a new wiki page ;) Back in the olden days, when I started 
implementing the resource resolving mechanism in DKPro Core, I set up such a 
page in the DKPro Core wiki [1]. This page eventually turned into the seed for 
documentation on resource packaging in DKPro Core, but the requirements 
collected are still clearly visible.

[1] https://code.google.com/p/dkpro-core-asl/wiki/ResourceProviderAPI

Original comment by richard.eckart on 8 Mar 2015 at 8:32

GoogleCodeExporter commented 9 years ago
I wrote things together here: 
https://code.google.com/p/dkpro-tc/wiki/ReusableTCModels

The page is not linked, didn't know where to place it...

Original comment by Tobias.H...@gmail.com on 9 Mar 2015 at 7:32