kermitt2 / grobid_client_python

Python client for GROBID Web services
Apache License 2.0
275 stars 74 forks source link

GROBID server does not appear up and running, the connection to the server failed #76

Closed zlh-source closed 3 months ago

zlh-source commented 3 months ago

Hello, I'm trying to run the "example.py" of grobid_client_python on linux.

The error reported is: GROBID server does not appear up and running, the connection to the server failed.

lfoppiano commented 3 months ago

@zlh-source are you running a service or pointing to a running grobid service? As it's stated in the documentation:

You need first a running grobid service, latest stable version, see the [documentation](http://grobid.readthedocs.io/) for installation. By default, it is assumed that the server will run on the address http://localhost:8070. You can change the server address by editing the file config.json, see below.
zlh-source commented 3 months ago

Thank you for your reply. I have it running successfully. However, I found two problems:

  1. GROBID seems to only support English? I tried to analyze PDFs in Chinese, but the results were terrible.
  2. GROBID seems to be better at extracting only academic papers with standard formats. But the effect is very poor for some PDFs, even if the format of these PDFs is very standardized. For example, I tried to extract papers with line numbers (blind review versions of ACL conference papers) and found that only about 1/4 of the content was successfully identified.
lfoppiano commented 3 months ago

Hi @zlh-source:

  1. Grobid supports English mainly, but also French and German. In general it works for other European languages. Chines, Korean, and Japanese character are supported but there is no training data for them so it's expected that the results are confused (https://github.com/kermitt2/grobid/issues/1049#issuecomment-1712554109)
  2. Grobid supports mainly scholarly scientific papers and patents. Articles with line numbers are also supported (in general - but there are always exceptions 😅), however if you sent the whole proceedings, it's expected that it might not be fully/correctly extracted. if you can share some example of documents, it can help understanding the specific issue if any, and, if they are CC, we can collect them and use them at some point for adding more training data.
zlh-source commented 3 months ago

thank you for your reply. Below is my pdf and results

error case.zip

lfoppiano commented 3 months ago

@zlh-source I finally found time to check this example. There are some troubles with the line numbers, normally they are handled by pdfAlto but not always. Secondly this article is a 2 columns (in addition to the line numbers). I had a glance and it seems that there are some troubles also to handle the line numbers on the second column, while the numbers on the first column are handled well.

In fact, the segmentation of the document is done well, with the header ending at the beginning of the introduction, however, for multi-columns papers, part of the success can be hoped only when the data flow is correctly respected. As in this case, as the columns are merged (e.g. part of the introduction in page one column 2 ends up in the abstract), there is nothing we can do in Grobid.

I think this is a good error case for PdfAlto. I've opened an issue there.