Closed Jacob-Jan closed 2 years ago
Hi @Jacob-Jan
Yes it's possible with the REST service too: processFulltextAssetDocument
returns a ZIP with TEI and images, see https://grobid.readthedocs.io/en/latest/Frequently-asked-questions/#i-would-also-like-to-extract-images-from-pdfs
Awesome! I totally overlooked that. Thanks!
Hello @kermitt2,
I can't get it to work. I use exact the same request as for processFulltextDocument
(which works).
But processFulltextAssetDocument
returns statuscode 500, [BAD_INPUT_DATA] PDF to XML conversion failed with error code: 1.
As far as I can see in the Java code it doesn't require different input. What am I missing?
Hi @Jacob-Jan
I didn't test since a while (my idea is to replace it with something better I am working on, so the "deprecated" for this service) but it looks still working on my side:
lopez@work:~/$ curl --form input=@/home/lopez/Downloads/AHMT-15946-life-course-origins-of-the-ages-of-menarche-and-menopause_011713.pdf localhost:8070/api/processFulltextAssetDocument > out.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2000k 0 632k 100 1368k 139k 302k 0:00:04 0:00:04 --:--:-- 442k
lopez@work:~/$ unzip -l out.zip
Archive: /home/lopez/out.zip
Length Date Time Name
--------- ---------- ----- ----
301851 2022-01-24 12:30 tei.xml
107985 2022-01-24 12:30 image-2.png
495645 2022-01-24 12:30 image-1.png
--------- -------
905481 3 files
I am working on Ubuntu.
Can you try maybe the direct pdfalto
command (the command is slightly different when we want the dump of the embedded images, because it adds a bit more runtime to the PDF parsing process), something like this:
/home/lopez/grobid/grobid-home/pdfalto/lin-64/pdfalto_server -fullFontName -noLineNumbers -annotation -filesLimit 2000 /home/lopez/Downloads/AHMT-15946-life-course-origins-of-the-ages-of-menarche-and-menopause_011713.pdf out.xml
Images should be under out.xml_data/
(including vectorial images converted into svg).
Okay, I got it figured out. I am mainly working on Windows where I had my curl like this: curl --form input=@C:\Users\mosse\Downloads\cyp042.pdf http://mydomain:8070/api/processFulltextAssetDocument
On Linux I got it to work with forward slashes, on Windows I assumed I had to use backslashes. But apparently I had to use forward slashes there too. The strange thing is that Windows did not complain about locating the file. So I assumed the path was correct, which was not the case.
Thanks for your help!
Hello,
I would like to host the Grobid service on a domain and call the API to process fulltext and retrieve the images. Is it possible to get the images returned from the request? As base64 maybe? Locally it is saving the images to PdfAssetPath, but that's not really useful if hosted somewhere else. How would I go about it?
Thank you in advance!
Jacob