google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

Improve Mallet wrapper #134

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently, the Mallet wrapper has several issues:
- POS tagging demo seems to run forever
- only CRM is added, but HMM would also be nice
- CRM wrapper hard-codes a lot of parameter settings, we should allow to set at 
least the most important ones via parameters (like in weka)

Original issue reported on code.google.com by torsten....@gmail.com on 27 May 2014 at 4:39

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 4 Jun 2014 at 11:51

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 4 Jun 2014 at 4:09

GoogleCodeExporter commented 9 years ago
Comments:

- currently almost all wrappers of Mallet methods are in the package 
mallet.util - this is not where I would have expected them (why not call the 
package mallet.wrappers)
- the package mallet.task contains a single call in TestTask to an original 
Mallet method, the TransducerEvaluator; I guess that should in principle also 
be wrapped so that all wrappers are included in a single package
- hardcoding of parameters: currently, the main Mallet training methods are 
wrapped

in the class TaskUtils in the util package (again I would not expect them to be 
there).
- the following parameters of CRF in Mallet are currently hard-coded:
currently the following parameters are used for the configuration of the CRF:
gaussianPriorVariance = 10.0
iterations = 1000
defaultLabel = "O";
orders = new int[]{0, 1, 2, 3, 4};
denseFeatureValues = true;

- the method trainCRF is called with these paramters:

    public static CRF trainCRF(InstanceList training, CRF crf, double gaussianPriorVariance, int iterations, String defaultLabel,
            boolean fullyConnected, int[] orders) {

trainCRF itself calls the Mallet function addOrderNStates:

                    crf.addOrderNStates(training, orders, null,
                            defaultLabel, null, null,
                            fullyConnected);

when comparing it with method the original method this seems like a bug: 
defaultLabel should be the 3rd param I think 

    public String addOrderNStates(InstanceList trainingSet, int[] orders,
            boolean[] defaults, String start,
            Pattern forbidden, Pattern allowed,
            boolean fullyConnected)

Other Questions:
can anybody add links to Mallet documentation here?
I hardly found any. Is only the code available as documentation?

Original comment by eckle.kohler on 17 Jul 2014 at 11:23

GoogleCodeExporter commented 9 years ago
hypothesis on bug turned out to be false - defaultLabel is called "start" in 
mallet

summarizing the inspection so far:

currently the following parameters are used for the configuration of the CRF:

gaussianPriorVariance = 10.0
iterations = 1000

defaultLabel = "O"; --- this is called start in Mallet:
     * @param start The label that represents the context of the start of
     * a sequence. It may be also used for sequence labels.  If no label of
     * this name exists, one will be added. Connection wills be added between
     * the start label and all other labels, even if <tt>fullyConnected</tt> is
     * <tt>false</tt>.  This argument may be null, in which case no special
     * start state is added.

orders = new int[]{0, 1, 2, 3, 4};
     * @param orders an array of increasing non-negative numbers giving
     * the orders of the features for this CRF. The largest number
     * <em>n</em> is the Markov order of the CRF. States are
     * <em>n</em>-tuples of output labels. Each of the other numbers
     * <em>k</em> in <code>orders</code> represents a weight set shared
     * by all destination states whose last (most recent) <em>k</em>
     * labels agree. If <code>orders</code> is <code>null</code>, an
     * order-0 CRF is built.

denseFeatureValues = true; --- unclear what that means

fullyConnected = false
     * @param fullyConnected Whether to include all allowed transitions,
     * even those not occurring in <code>trainingSet</code>,
this is hard-coded in the call of runTrainTest in the TestTask, see:

 TransducerEvaluator eval = TaskUtils.runTrainTest(fileTrain, fileTest, fileModel, gaussianPriorVariance, iterations, defaultLabel,
                false, orders, tagger, denseFeatureValues);

Original comment by eckle.kohler on 17 Jul 2014 at 1:48

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 29 Aug 2014 at 10:37

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 29 Aug 2014 at 10:59

GoogleCodeExporter commented 9 years ago
I tend to close this, as we have CRFSuite wrapper now.
Does someone needs Mallet instead of CRFSuite? 

Original comment by torsten....@gmail.com on 7 Nov 2014 at 9:20

GoogleCodeExporter commented 9 years ago
No objections; however we should at least deprecate the MALLET module, and 
maybe also drop it from the next release.

Original comment by daxenber...@gmail.com on 8 Nov 2014 at 2:57

GoogleCodeExporter commented 9 years ago
How do you deprecate a module?
I have never done that before.

Original comment by torsten....@gmail.com on 8 Nov 2014 at 7:41

GoogleCodeExporter commented 9 years ago
We could deprecate everything in this module on package level in the respective 
package-info.java (only works for the current namespace), or manually deprecate 
all classes/public methods.

Original comment by daxenber...@gmail.com on 10 Nov 2014 at 8:16

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r1301.

Original comment by daxenber...@gmail.com on 11 Dec 2014 at 3:45