kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.45k stars 443 forks source link

Remove batch mode #184

Open kermitt2 opened 7 years ago

kermitt2 commented 7 years ago

Batch mode appears to be used quite a lot, however it is far less efficient and flexible than the service mode which can exploit automatically multithreading and offers more options.

Some users select unfortunately the batch mode to evaluate the tool and report benchmarking - even worse relaunching a batch file per file while the tool is not designed at all for a cold start for each PDF (which supposes to reload all the models and lexical resources each time and can take 15-20s). The interest of having a fast processing (subsecond for extracting a header in a PDF, and ~4s for a complete full text structuring - which is faster for instance than PDFBox for simply parsing the PDF) is then lost.

The idea would be to replace the batch mode by a bash script or a node.js process that would start the service mode, and then send recursively in parallel PDF files present in directories, writing back the TEI extractions. This would offer the simplicity of command lines and the performance of the service mode.

lfoppiano commented 7 years ago

I have a similar solution, maybe simpler than having to deal with node.js/bash. We could just transform the "batch engine" as a simple client, like a wrapper to call the service via command line. In this way the client will require the server to run. Before doing any processing we ensure the user is not reloading models and doing any sort of dirty stuff.

If the service is down the client would just remind the user to start it ;-)

kermitt2 commented 7 years ago

Ah yes this would avoid the issue of portability of bash and to install node.js which often a pain.

jjlee commented 5 years ago

Hi, very useful looking software! (have not tried it yet because of batch mode having been removed -- by the way, the docs still mention it, should I raise an issue about that?)

I can't speak for other people but the reasons I would like some form of batch mode are:

  1. I don't want the program in memory or running when I'm not using it (which is almost all the time)
  2. I'm fine with it taking some time to load up, because I would only be running it occasionally, maybe from cron

So:

a. Your original suggestion of a script that starts and stops the server would solve the problem for me b. The issues you mention with batch mode are not a problem for me c. @lfoppiano 's suggested solution addresses a different problem than the one I have

jjlee commented 5 years ago

To clarify: ideal for me would be: a command line tool that deals with the starting then stopping, and the talking to the web service for a batch of files, without an install step separate from unpacking the grobid build .zip file. Er, and doesn't use docker, because I still haven't figured out how one does firewalling together with docker without me screwing it up :-)

jjlee commented 5 years ago

To clarify again: by no separate install step I mean no step that has to install files outside of the unpacked .zip directory: I'm fine with things getting installed in that directory, because the problem with installs for me is the pain of uninstalling them again (so some equivalent of ./configure's --prefix argument would also work for me, because I can use that with GNU stow).

entitled-ly y'rs ;-)

kermitt2 commented 5 years ago

Hello @jjlee !

Batch mode has not been removed and doc is in sync. (otherwise I guess this issue would have been closed!). Just try the tool.

In any cases, if we remove the batch mode, there will be a similar command line usage, transparent, no real change for the end-user. The service will start in background, do the stuff and shutdown. It's just that for us it's less code to maintain and the service is much more efficient so it's a bit a pity to see people using the batch mode so much if they are concerned by performance.

There is no zip unpacking involved in GROBID for standard usage. There is no system install, it's not C/C++, uninstalling is just delete the grobid directory :)