kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 460 forks source link

Segmentation Model #1042

Open uneetsingh opened 1 year ago

uneetsingh commented 1 year ago

Hi, Thank you for building such a good open source product.

For my use case, I was looking for architecture details and training process for the segmentation model. In Delft there are models for other cases (header, citation etc) but not for segmentation.

Use case is that I am trying to build a solution for docx. The route of docx -> pdf -> grobid wasn't promising because of limitation that pdfalto or any other OCR tool has.

If you can share/point me to the documentation for segmentation model, that will be very helpful.

kermitt2 commented 1 year ago

Hi @uneetsingh

By default the segmentation model runs with CRF using custom features. This model is working at line level, not at token level like the others.

You can train and use the segmentation model with RNN and DeLFT, see https://github.com/kermitt2/grobid/issues/964 but it's working less accurately than CRF for the moment. I didn't upload this model on DeLFT, but you can retrain it with:

python3 delft/applications/grobidTagger.py segmentation train_eval --architecture BidLSTM_CRF_FEATURES --input  data/sequenceLabelling/grobid/segmentation/segmentation-110322.train

4 years ago, I created a branch with Grobid supporting docx as input, see https://github.com/kermitt2/grobid/pull/515

It was simply using ApachePOI to parse docx and convert them to PDF. This conversion had poor results (lot's of docx parsing failures), indeed the route of docx -> pdf -> grobid is not promising. I wanted to try docx4j but I loose interest in the topic :)

A kind of docx -> xml converter, keeping some layout information, would be the best way to support docx I think. Then the segmentation model should be revisited/retrained to support this new input and new layout features.