microsimulation / ijm

A central place for general issues, documents, scripts and resources for the IJM
https://microsimulation.org/ijm/
MIT License
4 stars 1 forks source link

Convert IJM Issue 15(1) #148

Closed pbronka closed 2 years ago

pbronka commented 2 years ago

Hi @gnott ,

Would you be able to convert IJM Issue 15(1) for us please?

It's the following files from the S3 bucket:

| ijm-00244.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 63.8 KB | Standard   | ijm-00245.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 87.3 KB | Standard   | ijm-00246.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 217.6 KB | Standard   | ijm-00247.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 1.3 MB | Standard   | ijm-00248.zip | zip | May 26, 2022, 15:50:18 (UTC+01:00) | 409.3 KB | Standard   | ijm-00249.zip | zip | May 26, 2022, 15:50:17 (UTC+01:00) | 683.7 KB | Standard   | ijm-00250.zip | zip | May 26, 2022, 15:50:17 (UTC+01:00) | 315.8 KB | Standard   | ijm-00251.zip | zip | May 26, 2022, 15:50:18 (UTC+01:00) | 263.9 KB | Standard   | ijm-00252.zip | zip | May 26, 2022, 15:50:11 (UTC+01:00) | 194.3 KB | Standard   | ijm-00253.zip | zip | May 26, 2022, 15:50:12 (UTC+01:00) | 44.9 MB | Standard   | ijm-00254.zip | zip | May 26, 2022, 15:50:18 (UTC+01:00) | 123.6 KB | Standard   | ijm-00255.zip | zip | May 26, 2022, 15:50:19 (UTC+01:00) | 297.7 KB | Standard   | ijm-00256.zip | zip | May 26, 2022, 15:50:19 (UTC+01:00) | 339.3 KB | Standard   | ijm-00257.zip | zip | May 26, 2022, 15:50:19 (UTC+01:00) | 558.1 KB | Standard   | ijm-00258.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 5.1 MB | Standard   | ijm-00259.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 185.3 KB | Standard

Thank you!

gnott commented 2 years ago

Today is a good day to convert content, I did not encounter any validation issues and I created this PR with the content for review https://github.com/microsimulation/ijm/pull/149.

The only message I noticed in the XML to JSON conversion was where the code automatically formatted a URL for article 00256:

WARNING - 2022-05-26 17:04:52,441 - broken url: 'www.sesim.org' has become 'http://www.sesim.org' -- {"stack_info": null}

On the front-end of the journal I noticed something in the interface, where on the left side navigation the Figures and data link appears on some articles, but it results in an error page when you click on it, for example for article 00245 and others. It could be if the type is Research article then the link is shown? I don't know how the production environment of the site responds if the link is clicked.

pbronka commented 2 years ago

Today is a good day to convert content, I did not encounter any validation issues and I created this PR with the content for review #149.

The only message I noticed in the XML to JSON conversion was where the code automatically formatted a URL for article 00256:

WARNING - 2022-05-26 17:04:52,441 - broken url: 'www.sesim.org' has become 'http://www.sesim.org' -- {"stack_info": null}

On the front-end of the journal I noticed something in the interface, where on the left side navigation the Figures and data link appears on some articles, but it results in an error page when you click on it, for example for article 00245 and others. It could be if the type is Research article then the link is shown? I don't know how the production environment of the site responds if the link is clicked.

Thank you spotting this. It would appear that the link is present for all research articles, but if they don't have any figures, then it results in an error. I suppose this is something that would require modifications to the frontend of the website to solve?

I also noticed that some references have a lot of extra characters, e.g. in https://microsimulation.pub/articles/00245

, , , Den Privata Konsumtionen 1950-1970, Industriens Utredningsinstitut, , Stockholm, .

These extra characters show up in the json file:

"details": "\n, \n, \n, Den Privata Konsumtionen 1950-1970, Industriens Utredningsinstitut, \n, Stockholm, \n",

but not in the xml. Is this something that could be resolved in the conversion process, or there is some problem with the xml?

gnott commented 2 years ago

I'm not exactly sure how the Figures and data link logic works, I would investigate the front-end code first, and I couldn't find a reason in the JSON for why it would appear.

For the extra newline \n sequences, It looks like they appear on citations of type unknown, and when those are generated the parser is keeping the newline white space characters from the XML. The unknown type is generated when the incomplete data will cause a validation error. I might be able to pinpoint the reason and try to mitigate it by removing or omitting the extra newlines, what do you think?

pbronka commented 2 years ago

I might be able to pinpoint the reason and try to mitigate it by removing or omitting the extra newlines, what do you think?

Thank you, that would be great.

The unknown type is generated when the incomplete data will cause a validation error.

It sounds like a problem with the XML then? If you could let me know which elements are missing and causing the validation error I'll try to make sure they are there the next time.

gnott commented 2 years ago

The "type": "unknown" references may be hard to resolve, there are many of them. It was to allow older eLife XML to be fit into the more strict schema without changing the older XML. The reason vary but I can provide more details if you want to look into it.

~For example, article 00245 citation bib7 is type book in the XML. According to the parser logic I would guess it does not have a book title in the XML, and is therefore formatted as an unknown type reference since it would fail to validate as a book reference.~

Clarification to the above, it does not have publisher data, which would cause it to not be a validate book reference.

gnott commented 2 years ago

Clarified the above example, it is missing a publisher. (specifically a <publisher-name> XML tag, I think - it shows how difficult it is to diagnose each citation)

gnott commented 2 years ago

The \n character replacement attempt worked well, a new PR with corrected JSON output https://github.com/microsimulation/ijm/pull/150.