kermitt2 / Pub2TEI

Service for converting and enhancing heterogeneous publisher XML formats into TEI
Apache License 2.0
44 stars 14 forks source link

Empty tei output for Elsevier xml files #17

Open mariadelmarq opened 3 months ago

mariadelmarq commented 3 months ago

Hi again,

I'm getting empty tei xml outputs from GROBID for all of our Elsevier xml files, wondering if anybody could take a look?

mariadelmarq commented 1 month ago

Hi @kermitt2 , any idea what might be the issue here? The README says that Elsevier should be supported, but I get the same output for all Elsevier papers:

<?xml version="1.0" encoding="UTF-8"?>

lfoppiano commented 1 month ago

Hi @mariadelmarq, indeed there is an issue with some of the Elsevier files.

I tested pub2tei with these examples, however comments within those files says "normalized for easier text mining", which I don't know what they mean exactly.

@mariadelmarq could you check that those XMLs follow the same schema as the one you have (that, I suppose, they are not shareable)?

I wonder whether the Elsevier format has changed over time? 🤔

laurentromary commented 1 month ago

If the output is completely empty, it can be a sign of a namespace issue. Maybe you could upload a sample.

kermitt2 commented 1 month ago

The XSL support Elsevier when delivered in batch/archives (like for ISTEX). If I remember well, when the XML are obtained via their web API, an additional XML envelop is added to the same XML, with an additional namespace (namespaces and schema accumulate like an onion :) ), and I think this is breaking the transformation at the start.

mariadelmarq commented 1 month ago

@lfoppiano: I'll send through some examples by email.

@kermitt2 : yes, we are downloading the XML via the API, so that would very much explain it. How can I properly check for it, and is there an easy fix?

lfoppiano commented 1 month ago

Thanks to @laurentromary for quickly fixing the XSLTs to support this files.

I've deployed a snapshot on huggingface. Please have a look and test it with a larger batch.

curl --location 'https://lfoppiano-pub2tei-dev.hf.space/service/processXML' --form 'input=@"path_to_xml_file"'
mariadelmarq commented 1 month ago

Wow, thanks, everyone! Apologies for my ignorance, would you be able to provide a little bit more detail on how to test the snapshot on my files? I normally use the docker container for pub2tei:

docker run --rm --gpus all --init --ulimit core=0 -p 8060:8060 grobid/pub2tei:0.2

and I then use the python client to run it on all xml files in a local folder:

cd Pub2TEI/client
python3 pub2tei_client.py --input <input_folder> --output <output_folder>

Also, when I follow that huggingface link I get this:

image

lfoppiano commented 1 month ago

To use the huggingface service, you have to modify the file config.json and replace http://localhost:8060 with https://lfoppiano-pub2tei-dev.hf.space. Alternatively you could run locally a different docker image: lfoppiano/pub2tei:latest-develop-fix_elsevier which is going to have the update :-)

The response on huggingface (it's the same when you visit localhost:8060) is normal, is just that the application does not have a page that responds on the /.

mariadelmarq commented 1 week ago

@lfoppiano : I finally got around to testing this using the different docker image, it seems to be working fine!

The remaining issue for us with Elsevier journals is that Pub2TEI does miss the "conflict-of-interest" tag, e.g.: image