ncbi / PubReader

A new way to view journal articles
201 stars 81 forks source link

how to transform the nxml to the xml or html in order to use the pubreader codes. #6

Closed kwen94 closed 4 years ago

kwen94 commented 5 years ago

hello, the file format in the ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/00/00/ is nxml, but the test program you provided at https://github.com/ncbi/PubReader is xml format.I would appretiate it a lot if you could tell me how to transform the nxml to the xml or html in order to use the pubreader codes.

jats-laura commented 5 years ago

*.nxml files are the XML output of the PMC production process that ensures compliance with PMC style--they are XML. Please see the documentation on the FTP service: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/

rdmpage commented 5 years ago

Maybe it would helpful if this repository made it explicit that the XML used is unique to this project and nothing to do with the JATS XML you get from, say, PMC see https://github.com/ncbi/PubReader/issues/4#issuecomment-371562762 . Simply saying "oh, you can't get from there to here" leaves the user stranded, and also wondering what exactly is the point of this project. Why not release some code (e.g., a XSLT file) to transform JATS XML into the XML used, or even directly to the HTML format for PubReader? It's a pity such a cool project is let down by making it super hard for people to use.

kolotev commented 5 years ago

Let me add my 2 cents by providing a simple recipe to the original question.

how to transform the nxml to the xml or html in order to use the pubreader codes.

Here is the suggested recipe:

Thank you.

kwen94 commented 5 years ago

So according to what you said above,I think the PubReader is not used for the nxml files from PMC,but used for the xml in the test folder.If I want to transform the nxml in the PMC to html,I should turn to other methods. Could I say that the JAST xml depends on the file in https://jats.nlm.nih.gov/archiving/1.1d1/JATS-archivearticle1.dtd?And I should implement the tag set like %journalmeta.ent mentioned in JATS-archivearticle1.dtd? Look forward to your kind advice.Thanks a lot

kolotev commented 5 years ago

PubReader is not used for the nxml files from PMC

That is correct statement. It is used by PMC on XHTML content generated from nxml files.

If I want to transform the nxml in the PMC to html,I should turn to other methods.

That is correct. You can use any tool or write your own which would convert source nxml file into HTML or XHTML.

Could I say that the JAST xml depends on the file in https://jats.nlm.nih.gov/archiving/1.1d1/JATS-archivearticle1.dtd?

The short answer is YES. You may find other versions of JATS or NLM DTD used by PMC in nxml files. You can find detailed documentation on NLM & JATS DTDs on this site https://dtd.nlm.nih.gov/archiving/

And I should implement the tag set like %journalmeta.ent mentioned in JATS-archivearticle1.dtd?

You should not implement it. It is just a reference to a definitions in another file, which is part of that DTD. You probably have to take into account, that it is a modular DTD (consists of multiple files). And that link (https://jats.nlm.nih.gov/archiving/1.1d1/JATS-archivearticle1.dtd) points to main file (entry point) of the DTD. The XML parser suppose to pull the whole DTD at the time of parsing/validation.

If you would like to download the whole DTD or equivalent schemas to your local machine you can go here ftp://ftp.ncbi.nih.gov/pub/jats/archiving/1.1d1/

ncftp /pub/jats/archiving/1.1d1 > ls -la 
-r--r--r--   1 ftp      anonymous  4015766 Dec  6  2013 Archiving-1.1d1-TagLibrary.zip
-r--r--r--   1 ftp      anonymous    22188 Nov 14  2013 Archiving-Readme.txt
-r--r--r--   1 ftp      anonymous   278865 Dec  3  2013 JATS-Archiving-1.1d1-MathML2-DTD.zip
-r--r--r--   1 ftp      anonymous   278922 Dec  3  2013 JATS-Archiving-1.1d1-MathML3-DTD.zip
-r--r--r--   1 ftp      anonymous   309005 Dec  3  2013 JATS-Archiving-1.1d1-OASIS-MathML2-DTD.zip
-r--r--r--   1 ftp      anonymous   304667 Dec  3  2013 JATS-Archiving-1.1d1-OASIS-MathML3-DTD.zip
dr-xr-xr-x   2 ftp      anonymous     4096 Dec  9  2013 rng
dr-xr-xr-x   2 ftp      anonymous     4096 Jan  3  2014 xsd

Take into account one more aspect, PMC is using nxml files generated over long period of time, due to that you may find different nxml files using different versions of NLM or JATS DTDs.

I hope, these comments may help you in your efforts.