kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.53k stars 452 forks source link

Let Rest return images #887

Closed Jacob-Jan closed 2 years ago

Jacob-Jan commented 2 years ago

Hello,

I would like to host the Grobid service on a domain and call the API to process fulltext and retrieve the images. Is it possible to get the images returned from the request? As base64 maybe? Locally it is saving the images to PdfAssetPath, but that's not really useful if hosted somewhere else. How would I go about it?

Thank you in advance!

Jacob

kermitt2 commented 2 years ago

Hi @Jacob-Jan

Yes it's possible with the REST service too: processFulltextAssetDocument returns a ZIP with TEI and images, see https://grobid.readthedocs.io/en/latest/Frequently-asked-questions/#i-would-also-like-to-extract-images-from-pdfs

Jacob-Jan commented 2 years ago

Awesome! I totally overlooked that. Thanks!

Jacob-Jan commented 2 years ago

Hello @kermitt2,

I can't get it to work. I use exact the same request as for processFulltextDocument (which works). But processFulltextAssetDocument returns statuscode 500, [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1.

As far as I can see in the Java code it doesn't require different input. What am I missing?

kermitt2 commented 2 years ago

Hi @Jacob-Jan

I didn't test since a while (my idea is to replace it with something better I am working on, so the "deprecated" for this service) but it looks still working on my side:

lopez@work:~/$ curl --form input=@/home/lopez/Downloads/AHMT-15946-life-course-origins-of-the-ages-of-menarche-and-menopause_011713.pdf localhost:8070/api/processFulltextAssetDocument > out.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2000k    0  632k  100 1368k   139k   302k  0:00:04  0:00:04 --:--:--  442k
lopez@work:~/$ unzip -l out.zip 
Archive:  /home/lopez/out.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   301851  2022-01-24 12:30   tei.xml
   107985  2022-01-24 12:30   image-2.png
   495645  2022-01-24 12:30   image-1.png
---------                     -------
   905481                     3 files

I am working on Ubuntu.

Can you try maybe the direct pdfalto command (the command is slightly different when we want the dump of the embedded images, because it adds a bit more runtime to the PDF parsing process), something like this:

/home/lopez/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -annotation  -filesLimit 2000 /home/lopez/Downloads/AHMT-15946-life-course-origins-of-the-ages-of-menarche-and-menopause_011713.pdf out.xml

Images should be under out.xml_data/ (including vectorial images converted into svg).

Jacob-Jan commented 2 years ago

Okay, I got it figured out. I am mainly working on Windows where I had my curl like this: curl --form input=@C:\Users\mosse\Downloads\cyp042.pdf http://mydomain:8070/api/processFulltextAssetDocument

On Linux I got it to work with forward slashes, on Windows I assumed I had to use backslashes. But apparently I had to use forward slashes there too. The strange thing is that Windows did not complain about locating the file. So I assumed the path was correct, which was not the case.

Thanks for your help!