microsimulation / ijm

A central place for general issues, documents, scripts and resources for the IJM
https://microsimulation.org/ijm/
MIT License
3 stars 0 forks source link

A new issue and updates to some existing files ready to be converted #112

Open BlueReZZ opened 3 years ago

BlueReZZ commented 3 years ago

A number of files have been updated in the S3 bucket and require conversion to JSON etc.

Specifically there is a new issue 13-2 Summer 2020 and corrections made to articles in previous issues.

@gnott are you able to convert these and raise a pull request for us please?

Tasks

gnott commented 3 years ago

I'm up to the XML to JSON conversion stage, and there's a small error reported for character encoding in article 00208. Viewing it in Google Chrome also reports an error,

error on line 225 at column 20: Encoding error

That line number holds the <funding-statement>, inside it is (2014�2020) which looks to be the problem. The unknown character I think should be an n-dash character, or possibly more safely &#x2013;, which is used in other places in the XML for an n-dash.

Rather than altering the XML, I'll wait to hear back whether this can be altered and resupplied from the vendor.

gnott commented 3 years ago

Addtionally in the XML for article 00208, there might be an XML mismatch in lines 714-715:

<p>The EU-SILC data were made available by Eurostat under contract 175/2015-EU-SILC-ECHP-LFS.<P>
<P>The HHoT is now available as an integral part of EUROMOD.</p>

The uppercase P tags are the reason. The JATS parser we have is lenient enough it doesn't check this XML wellformedness trait.

gnott commented 3 years ago

A validation erorr with JSON generated from 00217 XML, which was also identified in December 18th, 2020 issues, is <sec id="s13"> has no content, only a title.

Loading the page on the website for 00217 results in an exception, so it's a validation error we cannot ignore.

If you could please either add content to this section or reformat it somehow so there is no section having no content in the XML and resupply the zip file to the bucket, I can convert it again.

This is blocking the PR for all issue 13-2 content.

gnott commented 3 years ago

In the zip for article 00214, the PDF file name ijm.00214.pdf is incorrect, and will not overwrite the existing PDF named ijm-00214.pdf. Please arrange for this file to be renamed and the article zip to be resupplied to the S3 bucket.

gnott commented 3 years ago

Thanks for creating the issue @BlueReZZ.

I created PR https://github.com/microsimulation/ijm/pull/113 so far.

I added some tasks to reach completion in the original comment. If you could please arrange for the "Resupply" and "Please confirm" tasks to be reviewed and implemented on your side, then I can generate and check the remaining content and create the next PRs.

BlueReZZ commented 3 years ago

Thanks @gnott that's really helpful, I'll pass on the resupply and confirmation tasks to @pbronka and he can liaise with Exeter.

pbronka commented 3 years ago

Thank you for checking and listing these issues @gnott and @BlueReZZ , I have asked Bala to fix them and resupply the articles. The only exception is the last point, i.e. the publication date for articles 00215-00220: it seems to be set to 31/08/2020 in the files I can see in the S3 AWS bucket, which I confirm is correct.

gnott commented 3 years ago

Thanks @pbronka, it was my error about the new article dates, they are correct. I must not have followed the complete procedure to copy them into the ijm project and I was looking at the old JSON files. Since article 00217 will not load, I was holding them back until the entire issue's worth of articles are ready, and I must not have checked them too carefuly. Good catch!

eLife-Exeter commented 3 years ago

@gnott I have corrected the articles 00208 and 00214 and uploaded to s3 bucket.

217, we are used the tag like below and loaded the file to s3 bucket.

<sec id="s13">
<title>C. Data processing and method for creation of new migrants and 12 year olds for dynamic microsimulation</title>
<sec id="s14">
...
</sec>
<sec id="s15">
...
</sec>
<sec id="s16">
...
</sec>
<sec id="s17">
...
</sec>
<sec id="s18">
...
</sec>
gnott commented 3 years ago

Thanks @eLife-Exeter for your message.

Regarding article 00217 sections, yes, the example you provided above where the <sec> tag are nested inside the <sec> tag is valid according to the schema, provided you include the final </sec> close tag at the end to close <sec id="s13">:

<sec id="s13">
<title>C. Data processing and method for creation of new migrants and 12 year olds for dynamic microsimulation</title>
<sec id="s14">
...
</sec>
<sec id="s15">
...
</sec>
<sec id="s16">
...
</sec>
<sec id="s17">
...
</sec>
<sec id="s18">
...
</sec>
</sec>
gnott commented 3 years ago

I noticed one final thing in the new articles, if you could please make one more change @eLife-Exeter?

The file name ijm-218-code1.zip I think would be better named as ijm-00218-code1.zip.

All the rest, as I can see, look great!

eLife-Exeter commented 3 years ago

@gnott I will make this change and let you know.

eLife-Exeter commented 3 years ago

@gnott ijm-00218 - I have updated this article to s3 bucket. Please check and confirm if this is okay.

gnott commented 3 years ago

Thanks @eLife-Exeter, the file name looks good to me.

There's an inline figure which didn't appear for me when testing, ijm-00218-inline001.jpg, and I don't think that is related to file names or the XML. It may only be due to my test configuration. Over to @BlueReZZ whether there is full support for inline images on the prod environment site.

gnott commented 3 years ago

The PRs are ending up with failed checks in CI, is that normal @BlueReZZ? Over to you, thanks!

pbronka commented 3 years ago

Thank you for converting the files @gnott, I can see the new issue on the website. Looking at the data and code availability (called data availability on the website) it seems that only the last <p> . </p> from that section in the XML is displayed, resulting in cut statements. Is that a conversion issue that you would be able to fix?

Also, for article ijm-00218, there are two links in the additional files section - the first one doesn't work and seems to be some local version.

I have also noticed that the competing interest section from the PDF/XML is not displayed on the website at all - is that due to the conversion or would some changes have to be made to the website to have these statements visible (with the acknowledgements, funding, etc.)?

gnott commented 3 years ago

Thanks for mentioning these points @pbronka, and I've tried to track down some details.

  1. Data availability paragraphs

The datasets content as structured for eLife aritcles is very specialised, which splits the content into used, generated and availability content. I found one IJM example which has more than one paragraph, and they'd all be considered availability type. What I found is what looks like a typo in the JATS XML parser code, so that only one paragraph ends up in the JSON output. It will be interesting to see if eLIfe aritcles are affected too, if there are any aritcles with multiple paragraphs.

I will be fixing that typo in the parser, and then I can regenerate the JSON for articles.

  1. Links in additional files section of article 00218

The only example I can find is 00218 which has a file of this type, which is in the JSON file named "additionalFiles". It is possilbe the journal software is not configured to display this section optimally. I defer to @BlueReZZ for more details https://microsimulation.pub/articles/00218/figures#files

The other files with a http://web:8082/ prefix in the JSON files are for the article PDF file, and the journal must rewrite those into a correct URL.

Article 00218 is also the only article with an inline file, in the notes footer of Table 1 (https://microsimulation.pub/articles/00218/figures#table1), and I think it is not loading correctly either.

  1. Competing Interest section

This may be perhaps the XML content in the <sec sec-type="additional-information"... section in the <back>. I think the parser used in the conversion I am using will only take particular sections in the back matter to populate content in the JSON output, and it does not look at the "additional-information" one. I cannot suggest how you might resolve it exactly, if it is not following eLife-style XML then it may be omitted from the JSON output when using eLife software for conversion at this time.

In summary, I will attempt to fix number 1, number 2 probably requires some PHP coding in the journal code, and number 3 may have a few possible solutions, depending on whether you considering changing the XML or if there's a way to modify the XML to JSON conversion process I'm following.

gnott commented 3 years ago

For data availability content, I corrected older articles in PR https://github.com/microsimulation/ijm/pull/113 to avoid any merge conflicts with those updated article JSON files.

A new PR https://github.com/microsimulation/ijm/pull/115 contains fixed articles from issue 13-2.