Segmentation Model - Githubissues

Hi @uneetsingh

By default the segmentation model runs with CRF using custom features. This model is working at line level, not at token level like the others.

You can train and use the segmentation model with RNN and DeLFT, see https://github.com/kermitt2/grobid/issues/964 but it's working less accurately than CRF for the moment. I didn't upload this model on DeLFT, but you can retrain it with:

python3 delft/applications/grobidTagger.py segmentation train_eval --architecture BidLSTM_CRF_FEATURES --input  data/sequenceLabelling/grobid/segmentation/segmentation-110322.train

4 years ago, I created a branch with Grobid supporting docx as input, see https://github.com/kermitt2/grobid/pull/515

It was simply using ApachePOI to parse docx and convert them to PDF. This conversion had poor results (lot's of docx parsing failures), indeed the route of docx -> pdf -> grobid is not promising. I wanted to try docx4j but I loose interest in the topic :)

A kind of docx -> xml converter, keeping some layout information, would be the best way to support docx I think. Then the segmentation model should be revisited/retrained to support this new input and new layout features.

kermitt2 / grobid

Segmentation Model #1042