Closed de-code closed 4 years ago
Thank you! Yes it's something we definitively want to add, layout features should bring some improvements and make the models competitive with the current CRF models that are using them (and maybe even better).
Just had a quick look at your implementation, great work I think! We can ignore the lexical features like prefix/suffix (8 first), but also more beyond 9, like shadow number and so on (which have many values) because we already have a character input channel in every architectures. Just one-hot encoding of layout features as you are doing is probably enough I think (note: there is a dense_to_one_hot()
function in preprocess.py
already used for the case features of one model).
The "gazetteers" features could help, but they typically don't help NER models, so it might also the case here.
We could imagine a first pass in the reader to see the number of values for each feature and use that to select which one will be used for one-hot encoding and concatenation.
Thanks a lot for the great contribution! (and sorry not be very reactive currently)
Finally got around to do some end-to-end evaluation.
Not yet using GROBID's evaluation (the first attempt failed and it doesn't quite fit into my workflow yet).
I haven't done it on the full PMC sample 1943
dataset but rather a random sample of 390.
On our author submitted dataset (not trained on yet) this looks about:
(Since 0.5.5 it's failing to convert a number of those manuscripts which I will need to investigate and are just ignored rather than counted negatively)
Implementation question, do we want to add a parameter in command line that says --use_features
or --ignore-features
?
I noticed that @de-code implemented a long list of command-line parameters in https://github.com/elifesciences/sciencebeam-trainer-delft, which might be hard to navigate, but useful to make quick scripting. What's your opinion?
Reason I'm asking is that right now I would like to make a quick test to see whether the features are impacting, adding a command-line parameter would allow me to run the command twice without touching anything.
We probably want yes :)
Likely a --ignore-features
I guess, because when features are available I think they will likely improve the results for many grobid models (because they capture layout information), so should be the default.
Thanks! I'm also wondering whether we should pass the features to the CRF layer in some way, explicitly?
No the CRF layer just acts as activation function before the output, so to compute the probability distributions of the possible labels from the last neuron layer. It's the role of the previous layers to "digest" these additional input features.
I see.
now in the implementation, I'm selecting a feature or not, only when the cardinality of values appearing in the training is below the feature max length (12).
@de-code implemented an additional parameter that allow the user to select explicitly which features to include. I think that would make the approach more resilient, for example, avoiding low variability (potentially) useless features to be included.
I think all the features with cardinality more than 12 are useless because they are all character-based patterns (prefix, suffix, word shape, ...) and the DL architectures have a character input channel already specifically dedicated to this. So these features would very likely be redundant and could actually rather degrade the training (via usual overfitting problems because they are very specific).
Of course features with cardinality less than 12 can also be useless, typically casing and gazetteer are not helping, so a feature selection mechanism certainly makes sense too, though these features might just be ignored during training - something interesting to benchmark!
OK, indeed.
Since @de-code already implemented something working, I would just integrate it as a list (including ranges). If not specified, the system will try automatic selection as I've implemented now.
Just some random thoughts on the feature indices:
Just some random thoughts on the feature indices:
* I wasn't sure how fixed the features are across the models, including GROBID submodules. Could they potentially provide different features?
Yes, it's up to the model designer / design
* In general it probably doesn't make much sense to provide the first 8 features or so to DeLFT as it should be the responsibility of the model to create those on-the fly. But it's probably just easier to keep them the same as what is currently used for Wapiti.
Yes, indeed.
* Keeping the feature indices internally has the advantage that they can be stored in the model config for visibility and being able to load an existing model with those indicies even after changing the default
Very good point.
* I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible
I see, for this I'm not trying to change the current approach at the moment
- I personally like being able to do as much hyper parameter "tuning" (hacking) via the command line as possible
There are too many different hyper parameters for each model, often several per layer, plus plenty of possible training parameters, I think it's not manageable with command line. What is often done in libraries supporting several architectures is to have dedicated config files, one for each architecture with the different hyper parameters (a bit like the current config file associated to each produced model).
The question is maybe how much these models make sense outside Grobid. In my original intent, DeLFT was not supposed to provide its own controls over the Grobid models: Only Grobid, with delft interfaced via JEP, trains, evals and runs models because only Grobid can generate the training data with features and the data to be labeled with features (because features and tokens are usually derived from the PDF). So ideally the grobidTagger.py
file was not supposed to stay or just as a way to debug models.
However I had a problem to train in DeLFT from Grobid, because the python training process and its “stdout” output often stuck when interfaced with JEP and never ends. I didn't find enough time to solve the problem and I added a training method just calling grobidTagger.py train
with an external process. It's working of course but it's more a hack, it should also normally use JEP.
Honestly I don't feel very good building too much stuff to manage grobid training data and models in DeLFT, because it is natural to drive that from Grobid and it would be redundant, painful to maintain. At least having exactly the same input files for both Wapiti and DeLFT is a must I think to keep things minimally simple and transparent.
I don't know if I am very clear with my original design idea, but of course if grobid models as such, independently from Grobid (maybe the date or person parsing models), are useful, this effort could be justified.
Training via GROBID has the advantage that is familiar to someone having trained Wapiti before.
But for me personally, it doesn't work very well.
The main one being that it requires a full GROBID setup and me trying to run the training on a separate machine on-demand. I do not own a GPU but I borrow it from the cloud for a short period. There are tools to do that from a Python code base but having to also have the GROBID setup (which isn't just a library or CLI call) would be a significant road block. And running the training from DeLFT directly, I can run training in parallel.
There is also no reason why a machine learning expert shouldn't be able to just improve the model via the Python code / DeLFT.
As for CLI parameters vs config file:
Google for example offers Hyperparamer tuning (which I haven't used), but as I understand it, it would also pass parameters to the CLI.
A config file is certainly better than having to make code changes. I would think of config files as something more persistent. For example we generate a config file as part of the model or the GROBID configuration. A config file could describe the default arguments while command line arguments could allow overriding the default. The command line parameter could be scoped and generated based the shared "tuning parameters" available via a config file.
The models themselves could probably considered to be be more generic . But maybe it makes sense for the CLI to be grobid specific until we have other use-cases?
There is also no reason why a machine learning expert shouldn't be able to just improve the model via the Python code / DeLFT.
Ok indeed, good to have the possibility to train in DeLFT/python, I agree!
So the only remaining problem is that DeLFT alone cannot generated the training/eval files with the features (the .train
and .test
generated by Grobid). We could think about a mechanism to generate thoses files in Grobid and place them automatically in the data/
directory of DeLFT, so that it is easy then to switch to Python for training/tuning/evaluating/etc.
We could add a dropwizard command in the grobid service to handle the integration with delft, such as generate the data, and, if needed other stuff
I would love if we could easily generate the training data. Even more so if we could parallelise it (e.g. via a cluster). A service would work well for that.
Here are the evaluation results using my implementation using different parameters...
All of them using the the training and test data generated from GROBID 0.5.6.
Apart from the mentioned parameters it is using common parameters, such as:
name | value |
---|---|
embeddings | glove.840B.300d |
word_lstm_units | 100 |
action | train_eval |
shuffle-input | True |
random-seed | 42 |
By feature embedding below, I mean a Dense layer after the feature input.
I am currently running the evaluation on the last epoch trained, although in some cases the f1 score for that epoch was going down, so probably should use the one with the highest score.
{ "recurrent_dropout": 0.5, "max_sequence_length": 500, "embeddings_name": "glove.840B.300d", "batch_size": 10, "num_char_lstm_units": 25, "case_embedding_size": 5, "case_vocab_size": 8, "num_word_lstm_units": 100, "max_char_length": 30, "use_features": false, "model_name": "header", "char_vocab_size": 305, "feature_indices": [], "use_ELMo": false, "fold_number": 1, "feature_embedding_size": 0, "dropout": 0.5, "max_feature_size": 123581, "model_type": "CustomBidLSTM_CRF", "char_embedding_size": 25, "use_char_feature": true, "use_crf": true, "word_embedding_size": 300, "use_BERT": false }
INFO 2020-01-20 12:46:11 +0000 master-replica-0 2055 train sequences INFO 2020-01-20 12:46:11 +0000 master-replica-0 229 validation sequences INFO 2020-01-20 12:46:11 +0000 master-replica-0 254 evaluation sequences INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 Layer (type) Output Shape Param # Connected to INFO 2020-01-20 12:46:11 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:46:11 +0000 master-replica-0 char_input (InputLayer) (None, None, 30) 0 INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 char_embeddings (TimeDistribute (None, None, 30, 25) 7625 char_input[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 word_input (InputLayer) (None, None, 300) 0 INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 char_lstm (TimeDistributed) (None, None, 50) 10200 char_embeddings[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 concatenate_1 (Concatenate) (None, None, 350) 0 word_input[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 char_lstm[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 dropout_1 (Dropout) (None, None, 350) 0 concatenate_1[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 bidirectional_2 (Bidirectional) (None, None, 200) 360800 dropout_1[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 dropout_2 (Dropout) (None, None, 200) 0 bidirectional_2[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 dense_1 (Dense) (None, None, 100) 20100 dropout_2[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 dense_ntags (Dense) (None, None, 43) 4343 dense_1[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:11 +0000 master-replica-0 chain_crf_1 (ChainCRF) (None, None, 43) 1935 dense_ntags[0][0] INFO 2020-01-20 12:46:11 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:46:11 +0000 master-replica-0 Total params: 405,003 INFO 2020-01-20 12:46:11 +0000 master-replica-0 Trainable params: 405,003 INFO 2020-01-20 12:46:11 +0000 master-replica-0 Non-trainable params: 0
INFO 2020-01-20 20:35:05 +0000 master-replica-0 training runtime: 28134.038 seconds INFO 2020-01-20 20:35:05 +0000 master-replica-0 Evaluation: INFO 2020-01-20 20:35:05 +0000 master-replica-0 f1 (micro): 67.72 INFO 2020-01-20 20:35:05 +0000 master-replica-0 precision recall f1-score support INFO 2020-01-20 20:35:05 +0000 master-replica-0 <date> 0.7581 0.7015 0.7287 67 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <phone> 0.0000 0.0000 0.0000 3 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <email> 0.8020 0.8020 0.8020 101 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <abstract> 0.8224 0.8013 0.8117 156 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <pubnum> 0.4490 0.4583 0.4536 48 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <web> 0.5333 0.4444 0.4848 18 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <note> 0.3509 0.2353 0.2817 170 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <affiliation> 0.6944 0.6711 0.6826 298 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <dedication> 1.0000 1.0000 1.0000 1 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <copyright> 0.7419 0.7188 0.7302 32 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <author> 0.7642 0.7292 0.7463 240 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <address> 0.7983 0.7364 0.7661 258 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <title> 0.7733 0.6960 0.7326 250 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <submission> 0.8056 0.7838 0.7945 37 INFO 2020-01-20 20:35:05 +0000 master-replica-0 submission> 0.0000 0.0000 0.0000 2 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <keyword> 0.9211 0.9211 0.9211 38 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <grant> 0.1250 0.1667 0.1429 6 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <degree> 0.7500 0.5000 0.6000 6 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <reference> 0.4688 0.3947 0.4286 76 INFO 2020-01-20 20:35:05 +0000 master-replica-0 <intro> 0.3913 0.4286 0.4091 42 INFO 2020-01-20 20:35:05 +0000 master-replica-0 all (micro avg.) 0.7066 0.6501 0.6772 1849
Evaluation: f1 (micro): 83.00 precision recall f1-score support <author> 0.9412 0.9412 0.9412 34 <address> 0.7097 0.6875 0.6984 32 <grant> 0.5000 0.5000 0.5000 2 <title> 0.9615 0.9615 0.9615 26 <note> 0.5000 0.1667 0.2500 6 <date> 1.0000 0.8571 0.9231 7 <intro> 1.0000 1.0000 1.0000 3 <keyword> 1.0000 1.0000 1.0000 2 <email> 0.7917 0.7600 0.7755 25 <affiliation> 0.7879 0.7647 0.7761 34 <submission> 0.0000 0.0000 0.0000 1 <web> 1.0000 1.0000 1.0000 3 <phone> 1.0000 0.6667 0.8000 3 <pubnum> 0.7500 0.7500 0.7500 4 <abstract> 0.9545 0.9545 0.9545 22 all (micro avg.) 0.8513 0.8137 0.8321 204
{ "embeddings_name": "glove.840B.300d", "recurrent_dropout": 0.5, "word_embedding_size": 300, "num_word_lstm_units": 100, "max_char_length": 30, "max_feature_size": 77, "case_vocab_size": 8, "fold_number": 1, "num_char_lstm_units": 25, "case_embedding_size": 5, "feature_embedding_size": 0, "use_crf": true, "char_vocab_size": 305, "model_name": "header", "char_embedding_size": 25, "max_sequence_length": 500, "use_BERT": false, "batch_size": 10, "use_char_feature": true, "dropout": 0.5, "model_type": "CustomBidLSTM_CRF", "use_ELMo": false, "feature_indices": [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ], "use_features": true }
INFO 2020-01-20 12:46:07 +0000 master-replica-0 2055 train sequences INFO 2020-01-20 12:46:07 +0000 master-replica-0 229 validation sequences INFO 2020-01-20 12:46:07 +0000 master-replica-0 254 evaluation sequences INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 Layer (type) Output Shape Param # Connected to INFO 2020-01-20 12:46:07 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:46:07 +0000 master-replica-0 char_input (InputLayer) (None, None, 30) 0 INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 char_embeddings (TimeDistribute (None, None, 30, 25) 7625 char_input[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 word_input (InputLayer) (None, None, 300) 0 INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 char_lstm (TimeDistributed) (None, None, 50) 10200 char_embeddings[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 features_input (InputLayer) (None, None, 77) 0 INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 concatenate_1 (Concatenate) (None, None, 427) 0 word_input[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 char_lstm[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 features_input[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 dropout_1 (Dropout) (None, None, 427) 0 concatenate_1[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 bidirectional_2 (Bidirectional) (None, None, 200) 422400 dropout_1[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 dropout_2 (Dropout) (None, None, 200) 0 bidirectional_2[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 dense_1 (Dense) (None, None, 100) 20100 dropout_2[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 dense_ntags (Dense) (None, None, 43) 4343 dense_1[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:07 +0000 master-replica-0 chain_crf_1 (ChainCRF) (None, None, 43) 1935 dense_ntags[0][0] INFO 2020-01-20 12:46:07 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:46:07 +0000 master-replica-0 Total params: 466,603 INFO 2020-01-20 12:46:07 +0000 master-replica-0 Trainable params: 466,603 INFO 2020-01-20 12:46:07 +0000 master-replica-0 Non-trainable params: 0
INFO 2020-01-20 20:19:14 +0000 master-replica-0 training runtime: 27184.473 seconds INFO 2020-01-20 20:19:14 +0000 master-replica-0 Evaluation: INFO 2020-01-20 20:19:14 +0000 master-replica-0 f1 (micro): 75.51 INFO 2020-01-20 20:19:14 +0000 master-replica-0 precision recall f1-score support INFO 2020-01-20 20:19:14 +0000 master-replica-0 <note> 0.4722 0.4000 0.4331 170 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <abstract> 0.8302 0.8462 0.8381 156 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <date> 0.8548 0.7910 0.8217 67 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <email> 0.8431 0.8515 0.8473 101 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <submission> 0.8108 0.8108 0.8108 37 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <phone> 0.0000 0.0000 0.0000 3 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <intro> 0.4318 0.4524 0.4419 42 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <pubnum> 0.7111 0.6667 0.6882 48 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <address> 0.8259 0.7907 0.8079 258 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <degree> 0.6000 0.5000 0.5455 6 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <dedication> 1.0000 1.0000 1.0000 1 INFO 2020-01-20 20:19:14 +0000 master-replica-0 submission> 0.0000 0.0000 0.0000 2 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <grant> 0.2727 0.5000 0.3529 6 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <title> 0.8408 0.8240 0.8323 250 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <reference> 0.6212 0.5395 0.5775 76 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <web> 0.5556 0.5556 0.5556 18 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <copyright> 0.7273 0.7500 0.7385 32 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <keyword> 0.8537 0.9211 0.8861 38 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <affiliation> 0.7778 0.7517 0.7645 298 INFO 2020-01-20 20:19:14 +0000 master-replica-0 <author> 0.8684 0.8250 0.8462 240 INFO 2020-01-20 20:19:14 +0000 master-replica-0 all (micro avg.) 0.7704 0.7404 0.7551 1849
Evaluation: f1 (micro): 85.50 precision recall f1-score support <email> 0.8261 0.7600 0.7917 25 <phone> 1.0000 0.6667 0.8000 3 <grant> 0.5000 0.5000 0.5000 2 <author> 0.9697 0.9412 0.9552 34 <keyword> 1.0000 1.0000 1.0000 2 <note> 0.6667 0.3333 0.4444 6 <date> 1.0000 1.0000 1.0000 7 <title> 1.0000 1.0000 1.0000 26 <affiliation> 0.7879 0.7647 0.7761 34 <pubnum> 0.7500 0.7500 0.7500 4 <address> 0.7419 0.7188 0.7302 32 <intro> 1.0000 1.0000 1.0000 3 <submission> 1.0000 1.0000 1.0000 1 <web> 1.0000 1.0000 1.0000 3 <abstract> 0.9545 0.9545 0.9545 22 all (micro avg.) 0.8769 0.8382 0.8571 204
{ "num_char_lstm_units": 25, "use_crf": true, "max_sequence_length": 500, "word_embedding_size": 300, "batch_size": 10, "use_BERT": false, "case_embedding_size": 5, "fold_number": 1, "feature_indices": [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ], "max_feature_size": 77, "use_features": true, "use_ELMo": false, "use_char_feature": true, "dropout": 0.5, "embeddings_name": "glove.840B.300d", "num_word_lstm_units": 100, "model_type": "CustomBidLSTM_CRF", "model_name": "header", "max_char_length": 30, "recurrent_dropout": 0.5, "char_vocab_size": 305, "case_vocab_size": 8, "char_embedding_size": 25, "feature_embedding_size": 50 }
INFO 2020-01-20 12:46:33 +0000 master-replica-0 2055 train sequences INFO 2020-01-20 12:46:33 +0000 master-replica-0 229 validation sequences INFO 2020-01-20 12:46:33 +0000 master-replica-0 254 evaluation sequences INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 Layer (type) Output Shape Param # Connected to INFO 2020-01-20 12:46:33 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:46:33 +0000 master-replica-0 char_input (InputLayer) (None, None, 30) 0 INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 char_embeddings (TimeDistribute (None, None, 30, 25) 7625 char_input[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 features_input (InputLayer) (None, None, 77) 0 INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 word_input (InputLayer) (None, None, 300) 0 INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 char_lstm (TimeDistributed) (None, None, 50) 10200 char_embeddings[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 feature_embeddings (TimeDistrib (None, None, 50) 3900 features_input[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 concatenate_1 (Concatenate) (None, None, 400) 0 word_input[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 char_lstm[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 feature_embeddings[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 dropout_1 (Dropout) (None, None, 400) 0 concatenate_1[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 bidirectional_2 (Bidirectional) (None, None, 200) 400800 dropout_1[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 dropout_2 (Dropout) (None, None, 200) 0 bidirectional_2[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 dense_1 (Dense) (None, None, 100) 20100 dropout_2[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 dense_ntags (Dense) (None, None, 43) 4343 dense_1[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:46:33 +0000 master-replica-0 chain_crf_1 (ChainCRF) (None, None, 43) 1935 dense_ntags[0][0] INFO 2020-01-20 12:46:33 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:46:33 +0000 master-replica-0 Total params: 448,903 INFO 2020-01-20 12:46:33 +0000 master-replica-0 Trainable params: 448,903 INFO 2020-01-20 12:46:33 +0000 master-replica-0 Non-trainable params: 0
INFO 2020-01-20 22:30:18 +0000 master-replica-0 training runtime: 20800.898 seconds INFO 2020-01-20 22:30:18 +0000 master-replica-0 Evaluation: INFO 2020-01-20 22:30:18 +0000 master-replica-0 f1 (micro): 75.01 INFO 2020-01-20 22:30:18 +0000 master-replica-0 precision recall f1-score support INFO 2020-01-20 22:30:18 +0000 master-replica-0 <pubnum> 0.6400 0.6667 0.6531 48 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <copyright> 0.7333 0.6875 0.7097 32 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <author> 0.8448 0.8167 0.8305 240 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <keyword> 0.8947 0.8947 0.8947 38 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <reference> 0.6029 0.5395 0.5694 76 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <degree> 0.4286 0.5000 0.4615 6 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <grant> 0.4444 0.6667 0.5333 6 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <email> 0.8367 0.8119 0.8241 101 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <affiliation> 0.7705 0.7550 0.7627 298 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <submission> 0.7895 0.8108 0.8000 37 INFO 2020-01-20 22:30:18 +0000 master-replica-0 submission> 0.0000 0.0000 0.0000 2 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <phone> 0.0000 0.0000 0.0000 3 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <title> 0.8537 0.8400 0.8468 250 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <web> 0.4348 0.5556 0.4878 18 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <abstract> 0.8250 0.8462 0.8354 156 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <intro> 0.5000 0.4762 0.4878 42 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <address> 0.8105 0.7791 0.7945 258 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <dedication> 1.0000 1.0000 1.0000 1 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <note> 0.4853 0.3882 0.4314 170 INFO 2020-01-20 22:30:18 +0000 master-replica-0 <date> 0.8387 0.7761 0.8062 67 INFO 2020-01-20 22:30:18 +0000 master-replica-0 all (micro avg.) 0.7646 0.7361 0.7501 1849
Evaluation: f1 (micro): 85.00 precision recall f1-score support <date> 0.8333 0.7143 0.7692 7 <keyword> 1.0000 1.0000 1.0000 2 <address> 0.7419 0.7188 0.7302 32 <pubnum> 1.0000 0.7500 0.8571 4 <email> 0.8261 0.7600 0.7917 25 <web> 0.7500 1.0000 0.8571 3 <abstract> 1.0000 1.0000 1.0000 22 <grant> 0.5000 0.5000 0.5000 2 <phone> 1.0000 0.6667 0.8000 3 <note> 1.0000 0.5000 0.6667 6 <author> 0.9412 0.9412 0.9412 34 <affiliation> 0.7879 0.7647 0.7761 34 <submission> 0.0000 0.0000 0.0000 1 <title> 1.0000 1.0000 1.0000 26 <intro> 1.0000 1.0000 1.0000 3 all (micro avg.) 0.8718 0.8333 0.8521 204
{ "use_features": true, "char_vocab_size": 305, "num_word_lstm_units": 100, "char_embedding_size": 25, "use_BERT": false, "model_name": "header", "embeddings_name": "glove.840B.300d", "dropout": 0.5, "batch_size": 10, "word_embedding_size": 300, "max_feature_size": 77, "case_embedding_size": 5, "num_char_lstm_units": 25, "feature_embedding_size": 30, "recurrent_dropout": 0.5, "model_type": "CustomBidLSTM_CRF", "feature_indices": [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ], "case_vocab_size": 8, "use_ELMo": false, "fold_number": 1, "max_char_length": 30, "max_sequence_length": 500, "use_crf": true, "use_char_feature": true }
INFO 2020-01-20 12:47:06 +0000 master-replica-0 2055 train sequences INFO 2020-01-20 12:47:06 +0000 master-replica-0 229 validation sequences INFO 2020-01-20 12:47:06 +0000 master-replica-0 254 evaluation sequences INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 Layer (type) Output Shape Param # Connected to INFO 2020-01-20 12:47:06 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:47:06 +0000 master-replica-0 char_input (InputLayer) (None, None, 30) 0 INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 char_embeddings (TimeDistribute (None, None, 30, 25) 7625 char_input[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 features_input (InputLayer) (None, None, 77) 0 INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 word_input (InputLayer) (None, None, 300) 0 INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 char_lstm (TimeDistributed) (None, None, 50) 10200 char_embeddings[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 feature_embeddings (TimeDistrib (None, None, 30) 2340 features_input[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 concatenate_1 (Concatenate) (None, None, 380) 0 word_input[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 char_lstm[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 feature_embeddings[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 dropout_1 (Dropout) (None, None, 380) 0 concatenate_1[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 bidirectional_2 (Bidirectional) (None, None, 200) 384800 dropout_1[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 dropout_2 (Dropout) (None, None, 200) 0 bidirectional_2[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 dense_1 (Dense) (None, None, 100) 20100 dropout_2[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 dense_ntags (Dense) (None, None, 43) 4343 dense_1[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:06 +0000 master-replica-0 chain_crf_1 (ChainCRF) (None, None, 43) 1935 dense_ntags[0][0] INFO 2020-01-20 12:47:06 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:47:06 +0000 master-replica-0 Total params: 431,343 INFO 2020-01-20 12:47:06 +0000 master-replica-0 Trainable params: 431,343 INFO 2020-01-20 12:47:06 +0000 master-replica-0 Non-trainable params: 0
INFO 2020-01-20 18:57:40 +0000 master-replica-0 training runtime: 22236.897 seconds INFO 2020-01-20 18:57:40 +0000 master-replica-0 Evaluation: INFO 2020-01-20 18:57:40 +0000 master-replica-0 f1 (micro): 75.27 INFO 2020-01-20 18:57:40 +0000 master-replica-0 precision recall f1-score support INFO 2020-01-20 18:57:40 +0000 master-replica-0 <abstract> 0.8428 0.8590 0.8508 156 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <note> 0.4437 0.3706 0.4038 170 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <affiliation> 0.7864 0.7785 0.7825 298 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <keyword> 0.9211 0.9211 0.9211 38 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <address> 0.8086 0.8023 0.8054 258 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <degree> 0.3333 0.3333 0.3333 6 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <author> 0.8462 0.8250 0.8354 240 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <date> 0.7681 0.7910 0.7794 67 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <intro> 0.4468 0.5000 0.4719 42 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <phone> 0.0000 0.0000 0.0000 3 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <reference> 0.5672 0.5000 0.5315 76 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <title> 0.8571 0.8400 0.8485 250 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <submission> 0.7568 0.7568 0.7568 37 INFO 2020-01-20 18:57:40 +0000 master-replica-0 submission> 0.0000 0.0000 0.0000 2 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <grant> 0.5000 0.6667 0.5714 6 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <copyright> 0.8000 0.7500 0.7742 32 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <email> 0.7900 0.7822 0.7861 101 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <web> 0.5417 0.7222 0.6190 18 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <pubnum> 0.7292 0.7292 0.7292 48 INFO 2020-01-20 18:57:40 +0000 master-replica-0 <dedication> 1.0000 1.0000 1.0000 1 INFO 2020-01-20 18:57:40 +0000 master-replica-0 all (micro avg.) 0.7612 0.7447 0.7529 1849
Evaluation: f1 (micro): 83.21 precision recall f1-score support <submission> 1.0000 1.0000 1.0000 1 <abstract> 0.9545 0.9545 0.9545 22 <web> 0.7500 1.0000 0.8571 3 <grant> 0.5000 0.5000 0.5000 2 <phone> 0.5000 0.3333 0.4000 3 <address> 0.7188 0.7188 0.7188 32 <date> 0.8571 0.8571 0.8571 7 <pubnum> 1.0000 0.7500 0.8571 4 <author> 0.9697 0.9412 0.9552 34 <title> 0.9615 0.9615 0.9615 26 <keyword> 1.0000 1.0000 1.0000 2 <note> 1.0000 0.1667 0.2857 6 <intro> 1.0000 1.0000 1.0000 3 <affiliation> 0.7879 0.7647 0.7761 34 <email> 0.7826 0.7200 0.7500 25 all (micro avg.) 0.8557 0.8137 0.8342 204
{ "char_embedding_size": 25, "fold_number": 1, "embeddings_name": "glove.840B.300d", "feature_indices": [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ], "max_char_length": 30, "use_crf": true, "batch_size": 10, "use_BERT": false, "max_sequence_length": 500, "model_type": "CustomBidLSTM_CRF", "recurrent_dropout": 0.5, "char_vocab_size": 305, "word_embedding_size": 300, "num_char_lstm_units": 25, "max_feature_size": 77, "model_name": "header", "dropout": 0.5, "case_embedding_size": 5, "num_word_lstm_units": 100, "use_char_feature": true, "use_features": true, "use_ELMo": false, "feature_embedding_size": 30, "case_vocab_size": 8 }
INFO 2020-01-20 12:47:34 +0000 master-replica-0 2055 train sequences INFO 2020-01-20 12:47:34 +0000 master-replica-0 229 validation sequences INFO 2020-01-20 12:47:34 +0000 master-replica-0 254 evaluation sequences INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 Layer (type) Output Shape Param # Connected to INFO 2020-01-20 12:47:34 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:47:34 +0000 master-replica-0 char_input (InputLayer) (None, None, 30) 0 INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 char_embeddings (TimeDistribute (None, None, 30, 25) 7625 char_input[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 features_input (InputLayer) (None, None, 77) 0 INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 word_input (InputLayer) (None, None, 300) 0 INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 char_lstm (TimeDistributed) (None, None, 50) 10200 char_embeddings[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 feature_embeddings (TimeDistrib (None, None, 30) 2340 features_input[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 concatenate_1 (Concatenate) (None, None, 380) 0 word_input[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 char_lstm[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 feature_embeddings[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 dropout_1 (Dropout) (None, None, 380) 0 concatenate_1[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 bidirectional_2 (Bidirectional) (None, None, 200) 384800 dropout_1[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 dropout_2 (Dropout) (None, None, 200) 0 bidirectional_2[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 dense_1 (Dense) (None, None, 100) 20100 dropout_2[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 dense_ntags (Dense) (None, None, 43) 4343 dense_1[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-20 12:47:34 +0000 master-replica-0 chain_crf_1 (ChainCRF) (None, None, 43) 1935 dense_ntags[0][0] INFO 2020-01-20 12:47:34 +0000 master-replica-0 ================================================================================================== INFO 2020-01-20 12:47:34 +0000 master-replica-0 Total params: 431,343 INFO 2020-01-20 12:47:34 +0000 master-replica-0 Trainable params: 431,343 INFO 2020-01-20 12:47:34 +0000 master-replica-0 Non-trainable params: 0
INFO 2020-01-20 22:02:02 +0000 master-replica-0 training runtime: 14670.097 seconds INFO 2020-01-20 22:02:02 +0000 master-replica-0 Evaluation: INFO 2020-01-20 22:02:02 +0000 master-replica-0 f1 (micro): 73.29 INFO 2020-01-20 22:02:02 +0000 master-replica-0 precision recall f1-score support INFO 2020-01-20 22:02:02 +0000 master-replica-0 <author> 0.8502 0.8042 0.8266 240 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <email> 0.8061 0.7822 0.7940 101 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <submission> 0.7941 0.7297 0.7606 37 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <address> 0.7967 0.7442 0.7695 258 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <note> 0.4367 0.4059 0.4207 170 INFO 2020-01-20 22:02:02 +0000 master-replica-0 submission> 0.0000 0.0000 0.0000 2 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <grant> 0.4286 0.5000 0.4615 6 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <copyright> 0.6667 0.6250 0.6452 32 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <date> 0.7812 0.7463 0.7634 67 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <keyword> 0.8718 0.8947 0.8831 38 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <pubnum> 0.6522 0.6250 0.6383 48 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <reference> 0.6780 0.5263 0.5926 76 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <abstract> 0.8323 0.8590 0.8454 156 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <dedication> 1.0000 1.0000 1.0000 1 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <degree> 0.4000 0.3333 0.3636 6 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <phone> 0.0000 0.0000 0.0000 3 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <web> 0.4545 0.5556 0.5000 18 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <intro> 0.5128 0.4762 0.4938 42 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <title> 0.8436 0.8200 0.8316 250 INFO 2020-01-20 22:02:02 +0000 master-replica-0 <affiliation> 0.7599 0.7114 0.7348 298 INFO 2020-01-20 22:02:02 +0000 master-replica-0 all (micro avg.) 0.7527 0.7144 0.7331 1849
Evaluation: f1 (micro): 82.91 precision recall f1-score support <web> 1.0000 1.0000 1.0000 3 <abstract> 0.9545 0.9545 0.9545 22 <submission> 1.0000 1.0000 1.0000 1 <note> 0.5000 0.1667 0.2500 6 <date> 0.8571 0.8571 0.8571 7 <address> 0.7188 0.7188 0.7188 32 <author> 0.9412 0.9412 0.9412 34 <email> 0.7619 0.6400 0.6957 25 <title> 0.9615 0.9615 0.9615 26 <intro> 1.0000 1.0000 1.0000 3 <keyword> 1.0000 1.0000 1.0000 2 <phone> 1.0000 0.6667 0.8000 3 <pubnum> 1.0000 0.7500 0.8571 4 <affiliation> 0.7879 0.7647 0.7761 34 <grant> 0.5000 0.5000 0.5000 2 all (micro avg.) 0.8549 0.8088 0.8312 204
{ "embeddings_name": "glove.840B.300d", "word_embedding_size": 300, "char_vocab_size": 305, "feature_indices": [ 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 ], "use_BERT": false, "model_name": "header", "use_crf": true, "num_char_lstm_units": 25, "case_vocab_size": 8, "case_embedding_size": 5, "batch_size": 10, "feature_embedding_size": 80, "char_embedding_size": 25, "num_word_lstm_units": 100, "fold_number": 1, "dropout": 0.5, "max_feature_size": 77, "model_type": "CustomBidLSTM_CRF", "use_char_feature": true, "max_sequence_length": 500, "recurrent_dropout": 0.5, "max_char_length": 30, "use_ELMo": false, "use_features": true }
INFO 2020-01-21 10:38:39 +0000 master-replica-0 2055 train sequences INFO 2020-01-21 10:38:39 +0000 master-replica-0 229 validation sequences INFO 2020-01-21 10:38:39 +0000 master-replica-0 254 evaluation sequences INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 Layer (type) Output Shape Param # Connected to INFO 2020-01-21 10:38:39 +0000 master-replica-0 ================================================================================================== INFO 2020-01-21 10:38:39 +0000 master-replica-0 char_input (InputLayer) (None, None, 30) 0 INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 char_embeddings (TimeDistribute (None, None, 30, 25) 7625 char_input[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 features_input (InputLayer) (None, None, 77) 0 INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 word_input (InputLayer) (None, None, 300) 0 INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 char_lstm (TimeDistributed) (None, None, 50) 10200 char_embeddings[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 feature_embeddings (TimeDistrib (None, None, 80) 6240 features_input[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 concatenate_1 (Concatenate) (None, None, 430) 0 word_input[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 char_lstm[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 feature_embeddings[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 dropout_1 (Dropout) (None, None, 430) 0 concatenate_1[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 bidirectional_2 (Bidirectional) (None, None, 200) 424800 dropout_1[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 dropout_2 (Dropout) (None, None, 200) 0 bidirectional_2[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 dense_1 (Dense) (None, None, 100) 20100 dropout_2[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 dense_ntags (Dense) (None, None, 43) 4343 dense_1[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 __________________________________________________________________________________________________ INFO 2020-01-21 10:38:39 +0000 master-replica-0 chain_crf_1 (ChainCRF) (None, None, 43) 1935 dense_ntags[0][0] INFO 2020-01-21 10:38:39 +0000 master-replica-0 ================================================================================================== INFO 2020-01-21 10:38:39 +0000 master-replica-0 Total params: 475,243 INFO 2020-01-21 10:38:39 +0000 master-replica-0 Trainable params: 475,243 INFO 2020-01-21 10:38:39 +0000 master-replica-0 Non-trainable params: 0
INFO 2020-01-21 16:50:30 +0000 master-replica-0 training runtime: 22309.499 seconds INFO 2020-01-21 16:50:30 +0000 master-replica-0 Evaluation: INFO 2020-01-21 16:50:30 +0000 master-replica-0 f1 (micro): 75.57 INFO 2020-01-21 16:50:30 +0000 master-replica-0 precision recall f1-score support INFO 2020-01-21 16:50:30 +0000 master-replica-0 <copyright> 0.7273 0.7500 0.7385 32 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <intro> 0.4898 0.5714 0.5275 42 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <phone> 0.0000 0.0000 0.0000 3 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <pubnum> 0.7500 0.6875 0.7174 48 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <author> 0.8504 0.8292 0.8397 240 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <abstract> 0.8758 0.8590 0.8673 156 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <dedication> 1.0000 1.0000 1.0000 1 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <degree> 0.5000 0.3333 0.4000 6 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <note> 0.4080 0.4176 0.4128 170 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <affiliation> 0.7819 0.7819 0.7819 298 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <grant> 0.5000 0.6667 0.5714 6 INFO 2020-01-21 16:50:30 +0000 master-replica-0 submission> 0.0000 0.0000 0.0000 2 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <reference> 0.6364 0.5526 0.5915 76 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <title> 0.8150 0.8280 0.8214 250 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <email> 0.8125 0.7723 0.7919 101 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <date> 0.7910 0.7910 0.7910 67 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <keyword> 0.8974 0.9211 0.9091 38 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <address> 0.8440 0.8178 0.8307 258 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <web> 0.5263 0.5556 0.5405 18 INFO 2020-01-21 16:50:30 +0000 master-replica-0 <submission> 0.7778 0.7568 0.7671 37 INFO 2020-01-21 16:50:30 +0000 master-replica-0 all (micro avg.) 0.7603 0.7512 0.7557 1849
Evaluation: f1 (micro): 85.00 precision recall f1-score support <date> 0.8571 0.8571 0.8571 7 <pubnum> 1.0000 0.7500 0.8571 4 <abstract> 0.9545 0.9545 0.9545 22 <email> 0.8636 0.7600 0.8085 25 <intro> 1.0000 1.0000 1.0000 3 <submission> 1.0000 1.0000 1.0000 1 <phone> 1.0000 0.6667 0.8000 3 <address> 0.6970 0.7188 0.7077 32 <keyword> 1.0000 1.0000 1.0000 2 <author> 0.9697 0.9412 0.9552 34 <web> 1.0000 1.0000 1.0000 3 <affiliation> 0.7879 0.7647 0.7761 34 <title> 1.0000 1.0000 1.0000 26 <grant> 0.5000 0.5000 0.5000 2 <note> 0.6667 0.3333 0.4444 6 all (micro avg.) 0.8718 0.8333 0.8521 204
@de-code thanks! There is a 2% gain in certain cases... mmm interesting
@de-code thanks! There is a 2% gain in certain cases... mmm interesting
Yes, I am not sure how good the header test is though. It seems relatively small.
Since I have the models and all of the checkpoints saved (and the logs), I could run the evaluation again on a different test set, just need it in that DeLFT format.
Hi @kermitt2
Something you are already well aware of but I thought it's good to have an issue to record the discussion around it. I am not sure whether you already experimented with adding layout features.
I've started doing it and implemented something here: https://github.com/elifesciences/sciencebeam-trainer-delft/pull/16
Maybe you'll find some of it useful. (I don't want to flood you with too many PRs)