softcite / tutorials

Tutorials for the Softcite tools
0 stars 1 forks source link

commandline vs in-notebook #1

Closed jameshowison closed 1 year ago

jameshowison commented 1 year ago

I'm working through the tutorial (thanks!) and I had a question or two.

This part

java -jar Samples/saxon9he.jar -s:/media/lopez/data/allofplos -xsl:Stylesheets/Publishers.xsl -o:/media/lopez/data/allofplos/tei -dtd:off -a:off -expand:off -t --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false 

If using the nlm2tei.py approach (ie inside a notebook or script) do we need to specify all those command line options within the config file? Or are those defaults?

actually looks at the code

Ah, there they are on https://github.com/kermitt2/article_dataset_builder/blob/master/article_dataset_builder/nlm2tei.py:86 so we don't need to specify them, just the path to

Although it doesn't have the --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false but I think that's taken care of by the empty .dtd files created as a workaround? Discussed in the README for pub2tei and at https://github.com/kermitt2/Pub2TEI/issues/3 If that option is working maybe it should be added to the nlm2tei.py file too?

kermitt2 commented 1 year ago

Ah good catch, indeed it needs the --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false, I forget to update this file. The empty dtd files should ensure it will work without downloading the DTDs online for the moment, but if some new JATS DTD files are introduced in the PMC XML files, it might not be enough, so the extra --parserFeature should be added.

I'll fix it in the article_dataset_builder package and update pypi with a new version.

kermitt2 commented 1 year ago

Note that in the tutorial everything is fine, because we directly have the archive of JATS files from PLOS without the need of article_dataset_builder, so I only introduce the command line.

kermitt2 commented 1 year ago

So article-dataset-builder is updated to version 0.2.5 which includes now --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false in the "command line" and the upgrade of lmdb version to 1.4.1.

I also upgraded lmdb to 1.4.1 in software-mentions-client==0.1.8

Sorry for the mess :)

jameshowison commented 1 year ago

No mess at all, just what happens when code is used :)

This will help because I was running into some issues with a slightly different dtd (JATS-archivearticle1-3-mathml3.dtd) which wasn't dummy declared. I will update and see!

jameshowison commented 1 year ago

Hmmm, I updated both packages via pip install ... --upgrade and that seemed to work (confirmed by output of

from importlib.metadata import version
version('article_dataset_builder')
# ==> '0.2.5'

I'm still seeing errors caused by missing mathml dtd:

Screen Shot 2023-07-13 at 12 57 10 PM

That said, I don't know how to find out if they end up mattering or not :). Should I be using

harvester.diagnostic(full=True)
jameshowison commented 1 year ago

Yeah, I'm seeing 886 transformations failed. There are a few "premature end" errors but also lots referring to the JATS-archivearticle1-3-mathml3.dtd file missing.

kermitt2 commented 1 year ago

You can check by looking if the file with extension .pub2tei.tei.xml is produced, something like that:

ls ./data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2.pub2tei.tei.xml
kermitt2 commented 1 year ago

Yeah, I'm seeing 886 transformations failed.

Ah bad... it's not the "premature end", it's to be expected with the empty dtd files. So it's the missing JATS-archivearticle1-3-mathml3.dtd :(

I was really expecting that the --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false prevent checking this file, this is annoying.

kermitt2 commented 1 year ago

@jameshowison what happens if you directly use the command line on the JATS file ?

java -jar Samples/saxon9he.jar -s:/home/jlh5498/data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2.nlm -xsl:Stylesheets/Publishers.xsl -o:/home/jlh5498/data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2.pub2tei.tei.xml -dtd:off -a:off -expand:off -t --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false 
jameshowison commented 1 year ago

Good thought! After adjusting the path and filename just a little it seems to work without that error.

(notebook) jlh5498@1a3e5700d2d9:~/Pub2TEI$ java -jar Samples/saxon9he.jar -s:/home/jlh5498/data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2/003ef548-e101-4991-a4e2-f23804aaa8a2.nxml  -xsl:Stylesheets/Publishers.xsl -o:/home/jlh5498/data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2/003ef548-e101-4991-a4e2-f23804aaa8a2.pub2tei.tei.xml -dtd:off -a:off -expand:off -t --parserFeature?uri=http%3A//apache.org/xml/features/nonvalidating/load-external-dtd:false 
Saxon-HE 9.9.0.2J from Saxonica
Java version 11.0.19
URIResolver.resolve href="Imports.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="Default.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="NameComponents.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="JournalComponents.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="BookComponents.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="NamesDatesPlaces.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="KeywordsAbstract.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="FullTextTags.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="ISOifiers.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="Organisations.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="Bibliography.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="Tables.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="Figures.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Imports.xsl"
URIResolver.resolve href="BMJ.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="EDPSArticle.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="EDPSedp-article.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="ScholarOne.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="NLM2TEI-article.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="Elsevier.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="ElsevierFormula.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Elsevier.xsl"
URIResolver.resolve href="Nature.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="ArticleSetNLMV2.0.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="Sage.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="IOP.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="SpringerCommon.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="SpringerStage2.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="SpringerStage3.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="SpringerBookChapter.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="RoyalChemicalSociety.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="Wiley.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="BookChapter.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
URIResolver.resolve href="Duke.xsl" base="file:/home/jlh5498/Pub2TEI/Stylesheets/Publishers.xsl"
Stylesheet compilation time: 7.267641712s (7267.641712ms)
Processing file:/home/jlh5498/data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2/003ef548-e101-4991-a4e2-f23804aaa8a2.nxml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/home/jlh5498/data/00/3e/f5/48/003ef548-e101-4991-a4e2-f23804aaa8a2/003ef548-e101-4991-a4e2-f23804aaa8a2.nxml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 15.181792ms
Tree size: 5951 nodes, 111427 characters, 1523 attributes
Converting an NLM 2.2 article
Current: pmc-release
Pubdate year: 2022
Current: ppub
Pubdate year: 2022
Unknown element: name: media - local-name: media -
            namespace-uri:  -
                xlink:href="mmc4.mp4" 
Unknown element: name: part-title - local-name: part-title -
            namespace-uri:  -

Unknown element: name: part-title - local-name: part-title -
            namespace-uri:  -

Unknown element: name: media - local-name: media -
            namespace-uri:  -
                xlink:href="mmc1.pdf" 
Unknown element: name: media - local-name: media -
            namespace-uri:  -
                xlink:href="mmc2.xlsx" 
Unknown element: name: media - local-name: media -
            namespace-uri:  -
                xlink:href="mmc3.xlsx" 
Unknown element: name: media - local-name: media -
            namespace-uri:  -
                xlink:href="mmc5.pdf" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs1" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs2" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs3" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs4" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs5" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs6" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs7" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs8" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs9" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs10" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs11" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs12" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs13" 
Unknown element: name: funding-source - local-name: funding-source -
            namespace-uri:  -
                id="gs14" 
Execution time: 221.782208ms
Memory used: 353,674,240
jameshowison commented 1 year ago

Only difference I can see between https://github.com/kermitt2/article_dataset_builder/blob/master/article_dataset_builder/nlm2tei.py and the commandline version is the order of the params (-t is last in the code but secondlast on the commandline? No idea if that matters :)

kermitt2 commented 1 year ago

Only difference I can see between https://github.com/kermitt2/article_dataset_builder/blob/master/article_dataset_builder/nlm2tei.py and the commandline version is the order of the params (-t is last in the code but secondlast on the commandline? No idea if that matters :)

It seems that it matters ! I updated the article_dataset_builder package one more time with the different argument order, package version 0.2.6

jameshowison commented 1 year ago

Yup, that worked! Weird, but there are a lot of steps in that code execution (At least python string --> commandline --> java --> Saxon argument parsing ...)

Now showing 11 conversion errors, although a little difficult to see what they are given the output (lots of stuff about mathml and unknown elements). No messages about dtd.

harvester.diagnostic(full=True)

shows

total entries: 1488
...
total entries with Pub2TEI TEI file: 1456

So a few more gremlins, but I think this particular one is sorted :)