Open mariadelmarq opened 3 months ago
Hi @kermitt2 , any idea what might be the issue here? The README says that Elsevier should be supported, but I get the same output for all Elsevier papers:
<?xml version="1.0" encoding="UTF-8"?>
Hi @mariadelmarq, indeed there is an issue with some of the Elsevier files.
I tested pub2tei with these examples, however comments within those files says "normalized for easier text mining", which I don't know what they mean exactly.
@mariadelmarq could you check that those XMLs follow the same schema as the one you have (that, I suppose, they are not shareable)?
I wonder whether the Elsevier format has changed over time? 🤔
If the output is completely empty, it can be a sign of a namespace issue. Maybe you could upload a sample.
The XSL support Elsevier when delivered in batch/archives (like for ISTEX). If I remember well, when the XML are obtained via their web API, an additional XML envelop is added to the same XML, with an additional namespace (namespaces and schema accumulate like an onion :) ), and I think this is breaking the transformation at the start.
@lfoppiano: I'll send through some examples by email.
@kermitt2 : yes, we are downloading the XML via the API, so that would very much explain it. How can I properly check for it, and is there an easy fix?
Thanks to @laurentromary for quickly fixing the XSLTs to support this files.
I've deployed a snapshot on huggingface. Please have a look and test it with a larger batch.
curl --location 'https://lfoppiano-pub2tei-dev.hf.space/service/processXML' --form 'input=@"path_to_xml_file"'
Wow, thanks, everyone! Apologies for my ignorance, would you be able to provide a little bit more detail on how to test the snapshot on my files? I normally use the docker container for pub2tei:
docker run --rm --gpus all --init --ulimit core=0 -p 8060:8060 grobid/pub2tei:0.2
and I then use the python client to run it on all xml files in a local folder:
cd Pub2TEI/client
python3 pub2tei_client.py --input <input_folder> --output <output_folder>
Also, when I follow that huggingface link I get this:
To use the huggingface service, you have to modify the file config.json
and replace http://localhost:8060
with https://lfoppiano-pub2tei-dev.hf.space
. Alternatively you could run locally a different docker image: lfoppiano/pub2tei:latest-develop-fix_elsevier
which is going to have the update :-)
The response on huggingface (it's the same when you visit localhost:8060
) is normal, is just that the application does not have a page that responds on the /
.
@lfoppiano : I finally got around to testing this using the different docker image, it seems to be working fine!
The remaining issue for us with Elsevier journals is that Pub2TEI does miss the "conflict-of-interest" tag, e.g.:
Hi again,
I'm getting empty tei xml outputs from GROBID for all of our Elsevier xml files, wondering if anybody could take a look?