Add layout features to GROBID model

de-code commented 5 years ago

Hi @kermitt2

Something you are already well aware of but I thought it's good to have an issue to record the discussion around it. I am not sure whether you already experimented with adding layout features.

I've started doing it and implemented something here: https://github.com/elifesciences/sciencebeam-trainer-delft/pull/16

Maybe you'll find some of it useful. (I don't want to flood you with too many PRs)

kermitt2 commented 5 years ago

Thank you! Yes it's something we definitively want to add, layout features should bring some improvements and make the models competitive with the current CRF models that are using them (and maybe even better).

Just had a quick look at your implementation, great work I think! We can ignore the lexical features like prefix/suffix (8 first), but also more beyond 9, like shadow number and so on (which have many values) because we already have a character input channel in every architectures. Just one-hot encoding of layout features as you are doing is probably enough I think (note: there is a dense_to_one_hot() function in preprocess.py already used for the case features of one model).

The "gazetteers" features could help, but they typically don't help NER models, so it might also the case here.

We could imagine a first pass in the reader to see the number of values for each feature and use that to select which one will be used for one-hot encoding and concatenation.

Thanks a lot for the great contribution! (and sorry not be very reactive currently)

de-code commented 5 years ago

Finally got around to do some end-to-end evaluation.

Not yet using GROBID's evaluation (the first attempt failed and it doesn't quite fit into my workflow yet).

I haven't done it on the full PMC sample 1943 dataset but rather a random sample of 390.

On our author submitted dataset (not trained on yet) this looks about:

(Since 0.5.5 it's failing to convert a number of those manuscripts which I will need to investigate and are just ignored rather than counted negatively)

lfoppiano commented 4 years ago

Implementation question, do we want to add a parameter in command line that says --use_features or --ignore-features?

I noticed that @de-code implemented a long list of command-line parameters in https://github.com/elifesciences/sciencebeam-trainer-delft, which might be hard to navigate, but useful to make quick scripting. What's your opinion?

Reason I'm asking is that right now I would like to make a quick test to see whether the features are impacting, adding a command-line parameter would allow me to run the command twice without touching anything.

kermitt2 commented 4 years ago

We probably want yes :) Likely a --ignore-features I guess, because when features are available I think they will likely improve the results for many grobid models (because they capture layout information), so should be the default.

lfoppiano commented 4 years ago

Thanks! I'm also wondering whether we should pass the features to the CRF layer in some way, explicitly?

kermitt2 commented 4 years ago

No the CRF layer just acts as activation function before the output, so to compute the probability distributions of the possible labels from the last neuron layer. It's the role of the previous layers to "digest" these additional input features.

lfoppiano commented 4 years ago

I see.

now in the implementation, I'm selecting a feature or not, only when the cardinality of values appearing in the training is below the feature max length (12).

@de-code implemented an additional parameter that allow the user to select explicitly which features to include. I think that would make the approach more resilient, for example, avoiding low variability (potentially) useless features to be included.

kermitt2 commented 4 years ago

I think all the features with cardinality more than 12 are useless because they are all character-based patterns (prefix, suffix, word shape, ...) and the DL architectures have a character input channel already specifically dedicated to this. So these features would very likely be redundant and could actually rather degrade the training (via usual overfitting problems because they are very specific).

Of course features with cardinality less than 12 can also be useless, typically casing and gazetteer are not helping, so a feature selection mechanism certainly makes sense too, though these features might just be ignored during training - something interesting to benchmark!

lfoppiano commented 4 years ago

OK, indeed.

Since @de-code already implemented something working, I would just integrate it as a list (including ranges). If not specified, the system will try automatic selection as I've implemented now.

de-code commented 4 years ago

Just some random thoughts on the feature indices:

I wasn't sure how fixed the features are across the models, including GROBID submodules. Could they potentially provide different features?
In general it probably doesn't make much sense to provide the first 8 features or so to DeLFT as it should be the responsibility of the model to create those on-the fly. But it's probably just easier to keep them the same as what is currently used for Wapiti.
Keeping the feature indices internally has the advantage that they can be stored in the model config for visibility and being able to load an existing model with those indicies even after changing the default
I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible

lfoppiano commented 4 years ago

Just some random thoughts on the feature indices:

* I wasn't sure how fixed the features are across the models, including GROBID submodules. Could they potentially provide different features?

Yes, it's up to the model designer / design

* In general it probably doesn't make much sense to provide the first 8 features or so to DeLFT as it should be the responsibility of the model to create those on-the fly. But it's probably just easier to keep them the same as what is currently used for Wapiti.

Yes, indeed.

* Keeping the feature indices internally has the advantage that they can be stored in the model config for visibility and being able to load an existing model with those indicies even after changing the default

Very good point.

* I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible

I see, for this I'm not trying to change the current approach at the moment

kermitt2 commented 4 years ago

I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible

There are too many different hyper parameters for each model, often several per layer, plus plenty of possible training parameters, I think it's not manageable with command line. What is often done in libraries supporting several architectures is to have dedicated config files, one for each architecture with the different hyper parameters (a bit like the current config file associated to each produced model).

kermitt2 commented 4 years ago

The question is maybe how much these models make sense outside Grobid. In my original intent, DeLFT was not supposed to provide its own controls over the Grobid models: Only Grobid, with delft interfaced via JEP, trains, evals and runs models because only Grobid can generate the training data with features and the data to be labeled with features (because features and tokens are usually derived from the PDF). So ideally the grobidTagger.py file was not supposed to stay or just as a way to debug models.

However I had a problem to train in DeLFT from Grobid, because the python training process and its “stdout” output often stuck when interfaced with JEP and never ends. I didn't find enough time to solve the problem and I added a training method just calling grobidTagger.py train with an external process. It's working of course but it's more a hack, it should also normally use JEP.

Honestly I don't feel very good building too much stuff to manage grobid training data and models in DeLFT, because it is natural to drive that from Grobid and it would be redundant, painful to maintain. At least having exactly the same input files for both Wapiti and DeLFT is a must I think to keep things minimally simple and transparent.

I don't know if I am very clear with my original design idea, but of course if grobid models as such, independently from Grobid (maybe the date or person parsing models), are useful, this effort could be justified.

de-code commented 4 years ago

Training via GROBID has the advantage that is familiar to someone having trained Wapiti before.

But for me personally, it doesn't work very well.

The main one being that it requires a full GROBID setup and me trying to run the training on a separate machine on-demand. I do not own a GPU but I borrow it from the cloud for a short period. There are tools to do that from a Python code base but having to also have the GROBID setup (which isn't just a library or CLI call) would be a significant road block. And running the training from DeLFT directly, I can run training in parallel.

There is also no reason why a machine learning expert shouldn't be able to just improve the model via the Python code / DeLFT.

As for CLI parameters vs config file:

Google for example offers Hyperparamer tuning (which I haven't used), but as I understand it, it would also pass parameters to the CLI.

A config file is certainly better than having to make code changes. I would think of config files as something more persistent. For example we generate a config file as part of the model or the GROBID configuration. A config file could describe the default arguments while command line arguments could allow overriding the default. The command line parameter could be scoped and generated based the shared "tuning parameters" available via a config file.

The models themselves could probably considered to be be more generic . But maybe it makes sense for the CLI to be grobid specific until we have other use-cases?

kermitt2 commented 4 years ago

There is also no reason why a machine learning expert shouldn't be able to just improve the model via the Python code / DeLFT.

Ok indeed, good to have the possibility to train in DeLFT/python, I agree!

So the only remaining problem is that DeLFT alone cannot generated the training/eval files with the features (the .train and .test generated by Grobid). We could think about a mechanism to generate thoses files in Grobid and place them automatically in the data/ directory of DeLFT, so that it is easy then to switch to Python for training/tuning/evaluating/etc.

lfoppiano commented 4 years ago

We could add a dropwizard command in the grobid service to handle the integration with delft, such as generate the data, and, if needed other stuff

de-code commented 4 years ago

I would love if we could easily generate the training data. Even more so if we could parallelise it (e.g. via a cluster). A service would work well for that.

de-code commented 4 years ago

Here are the evaluation results using my implementation using different parameters...

All of them using the the training and test data generated from GROBID 0.5.6.

Apart from the mentioned parameters it is using common parameters, such as:

name	value
embeddings	glove.840B.300d
word_lstm_units	100
action	train_eval
shuffle-input	True
random-seed	42

By feature embedding below, I mean a Dense layer after the feature input.

I am currently running the evaluation on the last epoch trained, although in some cases the f1 score for that epoch was going down, so probably should use the one with the highest score.

No features (epoch 53, eval f1 83.00)

model config

{
    "recurrent_dropout": 0.5,
    "max_sequence_length": 500,
    "embeddings_name": "glove.840B.300d",
    "batch_size": 10,
    "num_char_lstm_units": 25,
    "case_embedding_size": 5,
    "case_vocab_size": 8,
    "num_word_lstm_units": 100,
    "max_char_length": 30,
    "use_features": false,
    "model_name": "header",
    "char_vocab_size": 305,
    "feature_indices": [],
    "use_ELMo": false,
    "fold_number": 1,
    "feature_embedding_size": 0,
    "dropout": 0.5,
    "max_feature_size": 123581,
    "model_type": "CustomBidLSTM_CRF",
    "char_embedding_size": 25,
    "use_char_feature": true,
    "use_crf": true,
    "word_embedding_size": 300,
    "use_BERT": false
}

keras model summary

INFO    2020-01-20 12:46:11 +0000   master-replica-0        2055 train sequences
INFO    2020-01-20 12:46:11 +0000   master-replica-0        229 validation sequences
INFO    2020-01-20 12:46:11 +0000   master-replica-0        254 evaluation sequences
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        Layer (type)                    Output Shape         Param #     Connected to                     
INFO    2020-01-20 12:46:11 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:46:11 +0000   master-replica-0        char_input (InputLayer)         (None, None, 30)     0                                            
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        word_input (InputLayer)         (None, None, 300)    0                                            
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        concatenate_1 (Concatenate)     (None, None, 350)    0           word_input[0][0]                 
INFO    2020-01-20 12:46:11 +0000   master-replica-0                                                                         char_lstm[0][0]                  
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        dropout_1 (Dropout)             (None, None, 350)    0           concatenate_1[0][0]              
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        bidirectional_2 (Bidirectional) (None, None, 200)    360800      dropout_1[0][0]                  
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO    2020-01-20 12:46:11 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:11 +0000   master-replica-0        chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO    2020-01-20 12:46:11 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:46:11 +0000   master-replica-0        Total params: 405,003
INFO    2020-01-20 12:46:11 +0000   master-replica-0        Trainable params: 405,003
INFO    2020-01-20 12:46:11 +0000   master-replica-0        Non-trainable params: 0

training summary

INFO    2020-01-20 20:35:05 +0000   master-replica-0        training runtime: 28134.038 seconds 
INFO    2020-01-20 20:35:05 +0000   master-replica-0        Evaluation:
INFO    2020-01-20 20:35:05 +0000   master-replica-0            f1 (micro): 67.72
INFO    2020-01-20 20:35:05 +0000   master-replica-0                          precision    recall  f1-score   support
INFO    2020-01-20 20:35:05 +0000   master-replica-0                  <date>     0.7581    0.7015    0.7287        67
INFO    2020-01-20 20:35:05 +0000   master-replica-0                 <phone>     0.0000    0.0000    0.0000         3
INFO    2020-01-20 20:35:05 +0000   master-replica-0                 <email>     0.8020    0.8020    0.8020       101
INFO    2020-01-20 20:35:05 +0000   master-replica-0              <abstract>     0.8224    0.8013    0.8117       156
INFO    2020-01-20 20:35:05 +0000   master-replica-0                <pubnum>     0.4490    0.4583    0.4536        48
INFO    2020-01-20 20:35:05 +0000   master-replica-0                   <web>     0.5333    0.4444    0.4848        18
INFO    2020-01-20 20:35:05 +0000   master-replica-0                  <note>     0.3509    0.2353    0.2817       170
INFO    2020-01-20 20:35:05 +0000   master-replica-0           <affiliation>     0.6944    0.6711    0.6826       298
INFO    2020-01-20 20:35:05 +0000   master-replica-0            <dedication>     1.0000    1.0000    1.0000         1
INFO    2020-01-20 20:35:05 +0000   master-replica-0             <copyright>     0.7419    0.7188    0.7302        32
INFO    2020-01-20 20:35:05 +0000   master-replica-0                <author>     0.7642    0.7292    0.7463       240
INFO    2020-01-20 20:35:05 +0000   master-replica-0               <address>     0.7983    0.7364    0.7661       258
INFO    2020-01-20 20:35:05 +0000   master-replica-0                 <title>     0.7733    0.6960    0.7326       250
INFO    2020-01-20 20:35:05 +0000   master-replica-0            <submission>     0.8056    0.7838    0.7945        37
INFO    2020-01-20 20:35:05 +0000   master-replica-0             submission>     0.0000    0.0000    0.0000         2
INFO    2020-01-20 20:35:05 +0000   master-replica-0               <keyword>     0.9211    0.9211    0.9211        38
INFO    2020-01-20 20:35:05 +0000   master-replica-0                 <grant>     0.1250    0.1667    0.1429         6
INFO    2020-01-20 20:35:05 +0000   master-replica-0                <degree>     0.7500    0.5000    0.6000         6
INFO    2020-01-20 20:35:05 +0000   master-replica-0             <reference>     0.4688    0.3947    0.4286        76
INFO    2020-01-20 20:35:05 +0000   master-replica-0                 <intro>     0.3913    0.4286    0.4091        42
INFO    2020-01-20 20:35:05 +0000   master-replica-0        all (micro avg.)     0.7066    0.6501    0.6772      1849

Evaluation

Evaluation:
    f1 (micro): 83.00
                  precision    recall  f1-score   support

        <author>     0.9412    0.9412    0.9412        34
       <address>     0.7097    0.6875    0.6984        32
         <grant>     0.5000    0.5000    0.5000         2
         <title>     0.9615    0.9615    0.9615        26
          <note>     0.5000    0.1667    0.2500         6
          <date>     1.0000    0.8571    0.9231         7
         <intro>     1.0000    1.0000    1.0000         3
       <keyword>     1.0000    1.0000    1.0000         2
         <email>     0.7917    0.7600    0.7755        25
   <affiliation>     0.7879    0.7647    0.7761        34
    <submission>     0.0000    0.0000    0.0000         1
           <web>     1.0000    1.0000    1.0000         3
         <phone>     1.0000    0.6667    0.8000         3
        <pubnum>     0.7500    0.7500    0.7500         4
      <abstract>     0.9545    0.9545    0.9545        22

all (micro avg.)     0.8513    0.8137    0.8321       204

Features 9-30, no feature embedding (epoch 50, eval f1 85.50)

model config

{
    "embeddings_name": "glove.840B.300d",
    "recurrent_dropout": 0.5,
    "word_embedding_size": 300,
    "num_word_lstm_units": 100,
    "max_char_length": 30,
    "max_feature_size": 77,
    "case_vocab_size": 8,
    "fold_number": 1,
    "num_char_lstm_units": 25,
    "case_embedding_size": 5,
    "feature_embedding_size": 0,
    "use_crf": true,
    "char_vocab_size": 305,
    "model_name": "header",
    "char_embedding_size": 25,
    "max_sequence_length": 500,
    "use_BERT": false,
    "batch_size": 10,
    "use_char_feature": true,
    "dropout": 0.5,
    "model_type": "CustomBidLSTM_CRF",
    "use_ELMo": false,
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "use_features": true
}

keras model summary

INFO    2020-01-20 12:46:07 +0000   master-replica-0        2055 train sequences
INFO    2020-01-20 12:46:07 +0000   master-replica-0        229 validation sequences
INFO    2020-01-20 12:46:07 +0000   master-replica-0        254 evaluation sequences
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        Layer (type)                    Output Shape         Param #     Connected to                     
INFO    2020-01-20 12:46:07 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:46:07 +0000   master-replica-0        char_input (InputLayer)         (None, None, 30)     0                                            
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        word_input (InputLayer)         (None, None, 300)    0                                            
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        features_input (InputLayer)     (None, None, 77)     0                                            
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        concatenate_1 (Concatenate)     (None, None, 427)    0           word_input[0][0]                 
INFO    2020-01-20 12:46:07 +0000   master-replica-0                                                                         char_lstm[0][0]                  
INFO    2020-01-20 12:46:07 +0000   master-replica-0                                                                         features_input[0][0]             
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        dropout_1 (Dropout)             (None, None, 427)    0           concatenate_1[0][0]              
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        bidirectional_2 (Bidirectional) (None, None, 200)    422400      dropout_1[0][0]                  
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO    2020-01-20 12:46:07 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:07 +0000   master-replica-0        chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO    2020-01-20 12:46:07 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:46:07 +0000   master-replica-0        Total params: 466,603
INFO    2020-01-20 12:46:07 +0000   master-replica-0        Trainable params: 466,603
INFO    2020-01-20 12:46:07 +0000   master-replica-0        Non-trainable params: 0

training summary

INFO    2020-01-20 20:19:14 +0000   master-replica-0        training runtime: 27184.473 seconds 
INFO    2020-01-20 20:19:14 +0000   master-replica-0        Evaluation:
INFO    2020-01-20 20:19:14 +0000   master-replica-0            f1 (micro): 75.51
INFO    2020-01-20 20:19:14 +0000   master-replica-0                          precision    recall  f1-score   support
INFO    2020-01-20 20:19:14 +0000   master-replica-0                  <note>     0.4722    0.4000    0.4331       170
INFO    2020-01-20 20:19:14 +0000   master-replica-0              <abstract>     0.8302    0.8462    0.8381       156
INFO    2020-01-20 20:19:14 +0000   master-replica-0                  <date>     0.8548    0.7910    0.8217        67
INFO    2020-01-20 20:19:14 +0000   master-replica-0                 <email>     0.8431    0.8515    0.8473       101
INFO    2020-01-20 20:19:14 +0000   master-replica-0            <submission>     0.8108    0.8108    0.8108        37
INFO    2020-01-20 20:19:14 +0000   master-replica-0                 <phone>     0.0000    0.0000    0.0000         3
INFO    2020-01-20 20:19:14 +0000   master-replica-0                 <intro>     0.4318    0.4524    0.4419        42
INFO    2020-01-20 20:19:14 +0000   master-replica-0                <pubnum>     0.7111    0.6667    0.6882        48
INFO    2020-01-20 20:19:14 +0000   master-replica-0               <address>     0.8259    0.7907    0.8079       258
INFO    2020-01-20 20:19:14 +0000   master-replica-0                <degree>     0.6000    0.5000    0.5455         6
INFO    2020-01-20 20:19:14 +0000   master-replica-0            <dedication>     1.0000    1.0000    1.0000         1
INFO    2020-01-20 20:19:14 +0000   master-replica-0             submission>     0.0000    0.0000    0.0000         2
INFO    2020-01-20 20:19:14 +0000   master-replica-0                 <grant>     0.2727    0.5000    0.3529         6
INFO    2020-01-20 20:19:14 +0000   master-replica-0                 <title>     0.8408    0.8240    0.8323       250
INFO    2020-01-20 20:19:14 +0000   master-replica-0             <reference>     0.6212    0.5395    0.5775        76
INFO    2020-01-20 20:19:14 +0000   master-replica-0                   <web>     0.5556    0.5556    0.5556        18
INFO    2020-01-20 20:19:14 +0000   master-replica-0             <copyright>     0.7273    0.7500    0.7385        32
INFO    2020-01-20 20:19:14 +0000   master-replica-0               <keyword>     0.8537    0.9211    0.8861        38
INFO    2020-01-20 20:19:14 +0000   master-replica-0           <affiliation>     0.7778    0.7517    0.7645       298
INFO    2020-01-20 20:19:14 +0000   master-replica-0                <author>     0.8684    0.8250    0.8462       240
INFO    2020-01-20 20:19:14 +0000   master-replica-0        all (micro avg.)     0.7704    0.7404    0.7551      1849

Evaluation

Evaluation:
    f1 (micro): 85.50
                  precision    recall  f1-score   support

         <email>     0.8261    0.7600    0.7917        25
         <phone>     1.0000    0.6667    0.8000         3
         <grant>     0.5000    0.5000    0.5000         2
        <author>     0.9697    0.9412    0.9552        34
       <keyword>     1.0000    1.0000    1.0000         2
          <note>     0.6667    0.3333    0.4444         6
          <date>     1.0000    1.0000    1.0000         7
         <title>     1.0000    1.0000    1.0000        26
   <affiliation>     0.7879    0.7647    0.7761        34
        <pubnum>     0.7500    0.7500    0.7500         4
       <address>     0.7419    0.7188    0.7302        32
         <intro>     1.0000    1.0000    1.0000         3
    <submission>     1.0000    1.0000    1.0000         1
           <web>     1.0000    1.0000    1.0000         3
      <abstract>     0.9545    0.9545    0.9545        22

all (micro avg.)     0.8769    0.8382    0.8571       204

Features 9-30, feature embedding 50 (epoch 36, eval f1 85.00)

model config

{
    "num_char_lstm_units": 25,
    "use_crf": true,
    "max_sequence_length": 500,
    "word_embedding_size": 300,
    "batch_size": 10,
    "use_BERT": false,
    "case_embedding_size": 5,
    "fold_number": 1,
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "max_feature_size": 77,
    "use_features": true,
    "use_ELMo": false,
    "use_char_feature": true,
    "dropout": 0.5,
    "embeddings_name": "glove.840B.300d",
    "num_word_lstm_units": 100,
    "model_type": "CustomBidLSTM_CRF",
    "model_name": "header",
    "max_char_length": 30,
    "recurrent_dropout": 0.5,
    "char_vocab_size": 305,
    "case_vocab_size": 8,
    "char_embedding_size": 25,
    "feature_embedding_size": 50
}

keras model summary

INFO    2020-01-20 12:46:33 +0000   master-replica-0        2055 train sequences
INFO    2020-01-20 12:46:33 +0000   master-replica-0        229 validation sequences
INFO    2020-01-20 12:46:33 +0000   master-replica-0        254 evaluation sequences
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        Layer (type)                    Output Shape         Param #     Connected to                     
INFO    2020-01-20 12:46:33 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:46:33 +0000   master-replica-0        char_input (InputLayer)         (None, None, 30)     0                                            
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        features_input (InputLayer)     (None, None, 77)     0                                            
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        word_input (InputLayer)         (None, None, 300)    0                                            
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        feature_embeddings (TimeDistrib (None, None, 50)     3900        features_input[0][0]             
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        concatenate_1 (Concatenate)     (None, None, 400)    0           word_input[0][0]                 
INFO    2020-01-20 12:46:33 +0000   master-replica-0                                                                         char_lstm[0][0]                  
INFO    2020-01-20 12:46:33 +0000   master-replica-0                                                                         feature_embeddings[0][0]         
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        dropout_1 (Dropout)             (None, None, 400)    0           concatenate_1[0][0]              
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        bidirectional_2 (Bidirectional) (None, None, 200)    400800      dropout_1[0][0]                  
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO    2020-01-20 12:46:33 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:46:33 +0000   master-replica-0        chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO    2020-01-20 12:46:33 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:46:33 +0000   master-replica-0        Total params: 448,903
INFO    2020-01-20 12:46:33 +0000   master-replica-0        Trainable params: 448,903
INFO    2020-01-20 12:46:33 +0000   master-replica-0        Non-trainable params: 0

training summary

INFO    2020-01-20 22:30:18 +0000   master-replica-0        training runtime: 20800.898 seconds 
INFO    2020-01-20 22:30:18 +0000   master-replica-0        Evaluation:
INFO    2020-01-20 22:30:18 +0000   master-replica-0            f1 (micro): 75.01
INFO    2020-01-20 22:30:18 +0000   master-replica-0                          precision    recall  f1-score   support
INFO    2020-01-20 22:30:18 +0000   master-replica-0                <pubnum>     0.6400    0.6667    0.6531        48
INFO    2020-01-20 22:30:18 +0000   master-replica-0             <copyright>     0.7333    0.6875    0.7097        32
INFO    2020-01-20 22:30:18 +0000   master-replica-0                <author>     0.8448    0.8167    0.8305       240
INFO    2020-01-20 22:30:18 +0000   master-replica-0               <keyword>     0.8947    0.8947    0.8947        38
INFO    2020-01-20 22:30:18 +0000   master-replica-0             <reference>     0.6029    0.5395    0.5694        76
INFO    2020-01-20 22:30:18 +0000   master-replica-0                <degree>     0.4286    0.5000    0.4615         6
INFO    2020-01-20 22:30:18 +0000   master-replica-0                 <grant>     0.4444    0.6667    0.5333         6
INFO    2020-01-20 22:30:18 +0000   master-replica-0                 <email>     0.8367    0.8119    0.8241       101
INFO    2020-01-20 22:30:18 +0000   master-replica-0           <affiliation>     0.7705    0.7550    0.7627       298
INFO    2020-01-20 22:30:18 +0000   master-replica-0            <submission>     0.7895    0.8108    0.8000        37
INFO    2020-01-20 22:30:18 +0000   master-replica-0             submission>     0.0000    0.0000    0.0000         2
INFO    2020-01-20 22:30:18 +0000   master-replica-0                 <phone>     0.0000    0.0000    0.0000         3
INFO    2020-01-20 22:30:18 +0000   master-replica-0                 <title>     0.8537    0.8400    0.8468       250
INFO    2020-01-20 22:30:18 +0000   master-replica-0                   <web>     0.4348    0.5556    0.4878        18
INFO    2020-01-20 22:30:18 +0000   master-replica-0              <abstract>     0.8250    0.8462    0.8354       156
INFO    2020-01-20 22:30:18 +0000   master-replica-0                 <intro>     0.5000    0.4762    0.4878        42
INFO    2020-01-20 22:30:18 +0000   master-replica-0               <address>     0.8105    0.7791    0.7945       258
INFO    2020-01-20 22:30:18 +0000   master-replica-0            <dedication>     1.0000    1.0000    1.0000         1
INFO    2020-01-20 22:30:18 +0000   master-replica-0                  <note>     0.4853    0.3882    0.4314       170
INFO    2020-01-20 22:30:18 +0000   master-replica-0                  <date>     0.8387    0.7761    0.8062        67
INFO    2020-01-20 22:30:18 +0000   master-replica-0        all (micro avg.)     0.7646    0.7361    0.7501      1849

Evaluation

Evaluation:
    f1 (micro): 85.00
                  precision    recall  f1-score   support

          <date>     0.8333    0.7143    0.7692         7
       <keyword>     1.0000    1.0000    1.0000         2
       <address>     0.7419    0.7188    0.7302        32
        <pubnum>     1.0000    0.7500    0.8571         4
         <email>     0.8261    0.7600    0.7917        25
           <web>     0.7500    1.0000    0.8571         3
      <abstract>     1.0000    1.0000    1.0000        22
         <grant>     0.5000    0.5000    0.5000         2
         <phone>     1.0000    0.6667    0.8000         3
          <note>     1.0000    0.5000    0.6667         6
        <author>     0.9412    0.9412    0.9412        34
   <affiliation>     0.7879    0.7647    0.7761        34
    <submission>     0.0000    0.0000    0.0000         1
         <title>     1.0000    1.0000    1.0000        26
         <intro>     1.0000    1.0000    1.0000         3

all (micro avg.)     0.8718    0.8333    0.8521       204

Features 9-30, feature embedding 30 (epoch 41, eval f1 83.21)

model config

{
    "use_features": true,
    "char_vocab_size": 305,
    "num_word_lstm_units": 100,
    "char_embedding_size": 25,
    "use_BERT": false,
    "model_name": "header",
    "embeddings_name": "glove.840B.300d",
    "dropout": 0.5,
    "batch_size": 10,
    "word_embedding_size": 300,
    "max_feature_size": 77,
    "case_embedding_size": 5,
    "num_char_lstm_units": 25,
    "feature_embedding_size": 30,
    "recurrent_dropout": 0.5,
    "model_type": "CustomBidLSTM_CRF",
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "case_vocab_size": 8,
    "use_ELMo": false,
    "fold_number": 1,
    "max_char_length": 30,
    "max_sequence_length": 500,
    "use_crf": true,
    "use_char_feature": true
}

keras model summary

INFO    2020-01-20 12:47:06 +0000   master-replica-0        2055 train sequences
INFO    2020-01-20 12:47:06 +0000   master-replica-0        229 validation sequences
INFO    2020-01-20 12:47:06 +0000   master-replica-0        254 evaluation sequences
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        Layer (type)                    Output Shape         Param #     Connected to                     
INFO    2020-01-20 12:47:06 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:47:06 +0000   master-replica-0        char_input (InputLayer)         (None, None, 30)     0                                            
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        features_input (InputLayer)     (None, None, 77)     0                                            
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        word_input (InputLayer)         (None, None, 300)    0                                            
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        feature_embeddings (TimeDistrib (None, None, 30)     2340        features_input[0][0]             
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        concatenate_1 (Concatenate)     (None, None, 380)    0           word_input[0][0]                 
INFO    2020-01-20 12:47:06 +0000   master-replica-0                                                                         char_lstm[0][0]                  
INFO    2020-01-20 12:47:06 +0000   master-replica-0                                                                         feature_embeddings[0][0]         
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        dropout_1 (Dropout)             (None, None, 380)    0           concatenate_1[0][0]              
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        bidirectional_2 (Bidirectional) (None, None, 200)    384800      dropout_1[0][0]                  
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO    2020-01-20 12:47:06 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:06 +0000   master-replica-0        chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO    2020-01-20 12:47:06 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:47:06 +0000   master-replica-0        Total params: 431,343
INFO    2020-01-20 12:47:06 +0000   master-replica-0        Trainable params: 431,343
INFO    2020-01-20 12:47:06 +0000   master-replica-0        Non-trainable params: 0

training summary

INFO    2020-01-20 18:57:40 +0000   master-replica-0        training runtime: 22236.897 seconds 
INFO    2020-01-20 18:57:40 +0000   master-replica-0        Evaluation:
INFO    2020-01-20 18:57:40 +0000   master-replica-0            f1 (micro): 75.27
INFO    2020-01-20 18:57:40 +0000   master-replica-0                          precision    recall  f1-score   support
INFO    2020-01-20 18:57:40 +0000   master-replica-0              <abstract>     0.8428    0.8590    0.8508       156
INFO    2020-01-20 18:57:40 +0000   master-replica-0                  <note>     0.4437    0.3706    0.4038       170
INFO    2020-01-20 18:57:40 +0000   master-replica-0           <affiliation>     0.7864    0.7785    0.7825       298
INFO    2020-01-20 18:57:40 +0000   master-replica-0               <keyword>     0.9211    0.9211    0.9211        38
INFO    2020-01-20 18:57:40 +0000   master-replica-0               <address>     0.8086    0.8023    0.8054       258
INFO    2020-01-20 18:57:40 +0000   master-replica-0                <degree>     0.3333    0.3333    0.3333         6
INFO    2020-01-20 18:57:40 +0000   master-replica-0                <author>     0.8462    0.8250    0.8354       240
INFO    2020-01-20 18:57:40 +0000   master-replica-0                  <date>     0.7681    0.7910    0.7794        67
INFO    2020-01-20 18:57:40 +0000   master-replica-0                 <intro>     0.4468    0.5000    0.4719        42
INFO    2020-01-20 18:57:40 +0000   master-replica-0                 <phone>     0.0000    0.0000    0.0000         3
INFO    2020-01-20 18:57:40 +0000   master-replica-0             <reference>     0.5672    0.5000    0.5315        76
INFO    2020-01-20 18:57:40 +0000   master-replica-0                 <title>     0.8571    0.8400    0.8485       250
INFO    2020-01-20 18:57:40 +0000   master-replica-0            <submission>     0.7568    0.7568    0.7568        37
INFO    2020-01-20 18:57:40 +0000   master-replica-0             submission>     0.0000    0.0000    0.0000         2
INFO    2020-01-20 18:57:40 +0000   master-replica-0                 <grant>     0.5000    0.6667    0.5714         6
INFO    2020-01-20 18:57:40 +0000   master-replica-0             <copyright>     0.8000    0.7500    0.7742        32
INFO    2020-01-20 18:57:40 +0000   master-replica-0                 <email>     0.7900    0.7822    0.7861       101
INFO    2020-01-20 18:57:40 +0000   master-replica-0                   <web>     0.5417    0.7222    0.6190        18
INFO    2020-01-20 18:57:40 +0000   master-replica-0                <pubnum>     0.7292    0.7292    0.7292        48
INFO    2020-01-20 18:57:40 +0000   master-replica-0            <dedication>     1.0000    1.0000    1.0000         1
INFO    2020-01-20 18:57:40 +0000   master-replica-0        all (micro avg.)     0.7612    0.7447    0.7529      1849

Evaluation

Evaluation:
    f1 (micro): 83.21
                  precision    recall  f1-score   support

    <submission>     1.0000    1.0000    1.0000         1
      <abstract>     0.9545    0.9545    0.9545        22
           <web>     0.7500    1.0000    0.8571         3
         <grant>     0.5000    0.5000    0.5000         2
         <phone>     0.5000    0.3333    0.4000         3
       <address>     0.7188    0.7188    0.7188        32
          <date>     0.8571    0.8571    0.8571         7
        <pubnum>     1.0000    0.7500    0.8571         4
        <author>     0.9697    0.9412    0.9552        34
         <title>     0.9615    0.9615    0.9615        26
       <keyword>     1.0000    1.0000    1.0000         2
          <note>     1.0000    0.1667    0.2857         6
         <intro>     1.0000    1.0000    1.0000         3
   <affiliation>     0.7879    0.7647    0.7761        34
         <email>     0.7826    0.7200    0.7500        25

all (micro avg.)     0.8557    0.8137    0.8342       204

Features 9-30, feature embedding 30 (epoch 31, eval f1 82.91) (accidentally used 30 again)

model config

{
    "char_embedding_size": 25,
    "fold_number": 1,
    "embeddings_name": "glove.840B.300d",
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "max_char_length": 30,
    "use_crf": true,
    "batch_size": 10,
    "use_BERT": false,
    "max_sequence_length": 500,
    "model_type": "CustomBidLSTM_CRF",
    "recurrent_dropout": 0.5,
    "char_vocab_size": 305,
    "word_embedding_size": 300,
    "num_char_lstm_units": 25,
    "max_feature_size": 77,
    "model_name": "header",
    "dropout": 0.5,
    "case_embedding_size": 5,
    "num_word_lstm_units": 100,
    "use_char_feature": true,
    "use_features": true,
    "use_ELMo": false,
    "feature_embedding_size": 30,
    "case_vocab_size": 8
}

keras model summary

INFO    2020-01-20 12:47:34 +0000   master-replica-0        2055 train sequences
INFO    2020-01-20 12:47:34 +0000   master-replica-0        229 validation sequences
INFO    2020-01-20 12:47:34 +0000   master-replica-0        254 evaluation sequences
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        Layer (type)                    Output Shape         Param #     Connected to                     
INFO    2020-01-20 12:47:34 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:47:34 +0000   master-replica-0        char_input (InputLayer)         (None, None, 30)     0                                            
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        features_input (InputLayer)     (None, None, 77)     0                                            
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        word_input (InputLayer)         (None, None, 300)    0                                            
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        feature_embeddings (TimeDistrib (None, None, 30)     2340        features_input[0][0]             
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        concatenate_1 (Concatenate)     (None, None, 380)    0           word_input[0][0]                 
INFO    2020-01-20 12:47:34 +0000   master-replica-0                                                                         char_lstm[0][0]                  
INFO    2020-01-20 12:47:34 +0000   master-replica-0                                                                         feature_embeddings[0][0]         
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        dropout_1 (Dropout)             (None, None, 380)    0           concatenate_1[0][0]              
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        bidirectional_2 (Bidirectional) (None, None, 200)    384800      dropout_1[0][0]                  
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO    2020-01-20 12:47:34 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-20 12:47:34 +0000   master-replica-0        chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO    2020-01-20 12:47:34 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-20 12:47:34 +0000   master-replica-0        Total params: 431,343
INFO    2020-01-20 12:47:34 +0000   master-replica-0        Trainable params: 431,343
INFO    2020-01-20 12:47:34 +0000   master-replica-0        Non-trainable params: 0

training summary

INFO    2020-01-20 22:02:02 +0000   master-replica-0        training runtime: 14670.097 seconds 
INFO    2020-01-20 22:02:02 +0000   master-replica-0        Evaluation:
INFO    2020-01-20 22:02:02 +0000   master-replica-0            f1 (micro): 73.29
INFO    2020-01-20 22:02:02 +0000   master-replica-0                          precision    recall  f1-score   support
INFO    2020-01-20 22:02:02 +0000   master-replica-0                <author>     0.8502    0.8042    0.8266       240
INFO    2020-01-20 22:02:02 +0000   master-replica-0                 <email>     0.8061    0.7822    0.7940       101
INFO    2020-01-20 22:02:02 +0000   master-replica-0            <submission>     0.7941    0.7297    0.7606        37
INFO    2020-01-20 22:02:02 +0000   master-replica-0               <address>     0.7967    0.7442    0.7695       258
INFO    2020-01-20 22:02:02 +0000   master-replica-0                  <note>     0.4367    0.4059    0.4207       170
INFO    2020-01-20 22:02:02 +0000   master-replica-0             submission>     0.0000    0.0000    0.0000         2
INFO    2020-01-20 22:02:02 +0000   master-replica-0                 <grant>     0.4286    0.5000    0.4615         6
INFO    2020-01-20 22:02:02 +0000   master-replica-0             <copyright>     0.6667    0.6250    0.6452        32
INFO    2020-01-20 22:02:02 +0000   master-replica-0                  <date>     0.7812    0.7463    0.7634        67
INFO    2020-01-20 22:02:02 +0000   master-replica-0               <keyword>     0.8718    0.8947    0.8831        38
INFO    2020-01-20 22:02:02 +0000   master-replica-0                <pubnum>     0.6522    0.6250    0.6383        48
INFO    2020-01-20 22:02:02 +0000   master-replica-0             <reference>     0.6780    0.5263    0.5926        76
INFO    2020-01-20 22:02:02 +0000   master-replica-0              <abstract>     0.8323    0.8590    0.8454       156
INFO    2020-01-20 22:02:02 +0000   master-replica-0            <dedication>     1.0000    1.0000    1.0000         1
INFO    2020-01-20 22:02:02 +0000   master-replica-0                <degree>     0.4000    0.3333    0.3636         6
INFO    2020-01-20 22:02:02 +0000   master-replica-0                 <phone>     0.0000    0.0000    0.0000         3
INFO    2020-01-20 22:02:02 +0000   master-replica-0                   <web>     0.4545    0.5556    0.5000        18
INFO    2020-01-20 22:02:02 +0000   master-replica-0                 <intro>     0.5128    0.4762    0.4938        42
INFO    2020-01-20 22:02:02 +0000   master-replica-0                 <title>     0.8436    0.8200    0.8316       250
INFO    2020-01-20 22:02:02 +0000   master-replica-0           <affiliation>     0.7599    0.7114    0.7348       298
INFO    2020-01-20 22:02:02 +0000   master-replica-0        all (micro avg.)     0.7527    0.7144    0.7331      1849

Evaluation

Evaluation:
    f1 (micro): 82.91
                  precision    recall  f1-score   support

           <web>     1.0000    1.0000    1.0000         3
      <abstract>     0.9545    0.9545    0.9545        22
    <submission>     1.0000    1.0000    1.0000         1
          <note>     0.5000    0.1667    0.2500         6
          <date>     0.8571    0.8571    0.8571         7
       <address>     0.7188    0.7188    0.7188        32
        <author>     0.9412    0.9412    0.9412        34
         <email>     0.7619    0.6400    0.6957        25
         <title>     0.9615    0.9615    0.9615        26
         <intro>     1.0000    1.0000    1.0000         3
       <keyword>     1.0000    1.0000    1.0000         2
         <phone>     1.0000    0.6667    0.8000         3
        <pubnum>     1.0000    0.7500    0.8571         4
   <affiliation>     0.7879    0.7647    0.7761        34
         <grant>     0.5000    0.5000    0.5000         2

all (micro avg.)     0.8549    0.8088    0.8312       204

Features 9-30, feature embedding 80 (epoch 40, eval f1 85.00)

model config

{
    "embeddings_name": "glove.840B.300d",
    "word_embedding_size": 300,
    "char_vocab_size": 305,
    "feature_indices": [
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20,
        21,
        22,
        23,
        24,
        25,
        26,
        27,
        28,
        29,
        30
    ],
    "use_BERT": false,
    "model_name": "header",
    "use_crf": true,
    "num_char_lstm_units": 25,
    "case_vocab_size": 8,
    "case_embedding_size": 5,
    "batch_size": 10,
    "feature_embedding_size": 80,
    "char_embedding_size": 25,
    "num_word_lstm_units": 100,
    "fold_number": 1,
    "dropout": 0.5,
    "max_feature_size": 77,
    "model_type": "CustomBidLSTM_CRF",
    "use_char_feature": true,
    "max_sequence_length": 500,
    "recurrent_dropout": 0.5,
    "max_char_length": 30,
    "use_ELMo": false,
    "use_features": true
}

keras model summary

INFO    2020-01-21 10:38:39 +0000   master-replica-0        2055 train sequences
INFO    2020-01-21 10:38:39 +0000   master-replica-0        229 validation sequences
INFO    2020-01-21 10:38:39 +0000   master-replica-0        254 evaluation sequences
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        Layer (type)                    Output Shape         Param #     Connected to                     
INFO    2020-01-21 10:38:39 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-21 10:38:39 +0000   master-replica-0        char_input (InputLayer)         (None, None, 30)     0                                            
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        char_embeddings (TimeDistribute (None, None, 30, 25) 7625        char_input[0][0]                 
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        features_input (InputLayer)     (None, None, 77)     0                                            
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        word_input (InputLayer)         (None, None, 300)    0                                            
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        char_lstm (TimeDistributed)     (None, None, 50)     10200       char_embeddings[0][0]            
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        feature_embeddings (TimeDistrib (None, None, 80)     6240        features_input[0][0]             
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        concatenate_1 (Concatenate)     (None, None, 430)    0           word_input[0][0]                 
INFO    2020-01-21 10:38:39 +0000   master-replica-0                                                                         char_lstm[0][0]                  
INFO    2020-01-21 10:38:39 +0000   master-replica-0                                                                         feature_embeddings[0][0]         
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        dropout_1 (Dropout)             (None, None, 430)    0           concatenate_1[0][0]              
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        bidirectional_2 (Bidirectional) (None, None, 200)    424800      dropout_1[0][0]                  
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]            
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]                  
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        dense_ntags (Dense)             (None, None, 43)     4343        dense_1[0][0]                    
INFO    2020-01-21 10:38:39 +0000   master-replica-0        __________________________________________________________________________________________________
INFO    2020-01-21 10:38:39 +0000   master-replica-0        chain_crf_1 (ChainCRF)          (None, None, 43)     1935        dense_ntags[0][0]                
INFO    2020-01-21 10:38:39 +0000   master-replica-0        ==================================================================================================
INFO    2020-01-21 10:38:39 +0000   master-replica-0        Total params: 475,243
INFO    2020-01-21 10:38:39 +0000   master-replica-0        Trainable params: 475,243
INFO    2020-01-21 10:38:39 +0000   master-replica-0        Non-trainable params: 0

training summary

INFO    2020-01-21 16:50:30 +0000   master-replica-0        training runtime: 22309.499 seconds 
INFO    2020-01-21 16:50:30 +0000   master-replica-0        Evaluation:
INFO    2020-01-21 16:50:30 +0000   master-replica-0            f1 (micro): 75.57
INFO    2020-01-21 16:50:30 +0000   master-replica-0                          precision    recall  f1-score   support
INFO    2020-01-21 16:50:30 +0000   master-replica-0             <copyright>     0.7273    0.7500    0.7385        32
INFO    2020-01-21 16:50:30 +0000   master-replica-0                 <intro>     0.4898    0.5714    0.5275        42
INFO    2020-01-21 16:50:30 +0000   master-replica-0                 <phone>     0.0000    0.0000    0.0000         3
INFO    2020-01-21 16:50:30 +0000   master-replica-0                <pubnum>     0.7500    0.6875    0.7174        48
INFO    2020-01-21 16:50:30 +0000   master-replica-0                <author>     0.8504    0.8292    0.8397       240
INFO    2020-01-21 16:50:30 +0000   master-replica-0              <abstract>     0.8758    0.8590    0.8673       156
INFO    2020-01-21 16:50:30 +0000   master-replica-0            <dedication>     1.0000    1.0000    1.0000         1
INFO    2020-01-21 16:50:30 +0000   master-replica-0                <degree>     0.5000    0.3333    0.4000         6
INFO    2020-01-21 16:50:30 +0000   master-replica-0                  <note>     0.4080    0.4176    0.4128       170
INFO    2020-01-21 16:50:30 +0000   master-replica-0           <affiliation>     0.7819    0.7819    0.7819       298
INFO    2020-01-21 16:50:30 +0000   master-replica-0                 <grant>     0.5000    0.6667    0.5714         6
INFO    2020-01-21 16:50:30 +0000   master-replica-0             submission>     0.0000    0.0000    0.0000         2
INFO    2020-01-21 16:50:30 +0000   master-replica-0             <reference>     0.6364    0.5526    0.5915        76
INFO    2020-01-21 16:50:30 +0000   master-replica-0                 <title>     0.8150    0.8280    0.8214       250
INFO    2020-01-21 16:50:30 +0000   master-replica-0                 <email>     0.8125    0.7723    0.7919       101
INFO    2020-01-21 16:50:30 +0000   master-replica-0                  <date>     0.7910    0.7910    0.7910        67
INFO    2020-01-21 16:50:30 +0000   master-replica-0               <keyword>     0.8974    0.9211    0.9091        38
INFO    2020-01-21 16:50:30 +0000   master-replica-0               <address>     0.8440    0.8178    0.8307       258
INFO    2020-01-21 16:50:30 +0000   master-replica-0                   <web>     0.5263    0.5556    0.5405        18
INFO    2020-01-21 16:50:30 +0000   master-replica-0            <submission>     0.7778    0.7568    0.7671        37
INFO    2020-01-21 16:50:30 +0000   master-replica-0        all (micro avg.)     0.7603    0.7512    0.7557      1849

Evaluation

Evaluation:
    f1 (micro): 85.00
                  precision    recall  f1-score   support

          <date>     0.8571    0.8571    0.8571         7
        <pubnum>     1.0000    0.7500    0.8571         4
      <abstract>     0.9545    0.9545    0.9545        22
         <email>     0.8636    0.7600    0.8085        25
         <intro>     1.0000    1.0000    1.0000         3
    <submission>     1.0000    1.0000    1.0000         1
         <phone>     1.0000    0.6667    0.8000         3
       <address>     0.6970    0.7188    0.7077        32
       <keyword>     1.0000    1.0000    1.0000         2
        <author>     0.9697    0.9412    0.9552        34
           <web>     1.0000    1.0000    1.0000         3
   <affiliation>     0.7879    0.7647    0.7761        34
         <title>     1.0000    1.0000    1.0000        26
         <grant>     0.5000    0.5000    0.5000         2
          <note>     0.6667    0.3333    0.4444         6

all (micro avg.)     0.8718    0.8333    0.8521       204

lfoppiano commented 4 years ago

@de-code thanks! There is a 2% gain in certain cases... mmm interesting

de-code commented 4 years ago

@de-code thanks! There is a 2% gain in certain cases... mmm interesting

Yes, I am not sure how good the header test is though. It seems relatively small.

Since I have the models and all of the checkpoints saved (and the logs), I could run the evaluation again on a different test set, just need it in that DeLFT format.

kermitt2 / delft