Closed pbronka closed 2 years ago
Today is a good day to convert content, I did not encounter any validation issues and I created this PR with the content for review https://github.com/microsimulation/ijm/pull/149.
The only message I noticed in the XML to JSON conversion was where the code automatically formatted a URL for article 00256
:
WARNING - 2022-05-26 17:04:52,441 - broken url: 'www.sesim.org' has become 'http://www.sesim.org' -- {"stack_info": null}
On the front-end of the journal I noticed something in the interface, where on the left side navigation the Figures and data
link appears on some articles, but it results in an error page when you click on it, for example for article 00245
and others. It could be if the type is Research article
then the link is shown? I don't know how the production environment of the site responds if the link is clicked.
Today is a good day to convert content, I did not encounter any validation issues and I created this PR with the content for review #149.
The only message I noticed in the XML to JSON conversion was where the code automatically formatted a URL for article
00256
:
WARNING - 2022-05-26 17:04:52,441 - broken url: 'www.sesim.org' has become 'http://www.sesim.org' -- {"stack_info": null}
On the front-end of the journal I noticed something in the interface, where on the left side navigation the
Figures and data
link appears on some articles, but it results in an error page when you click on it, for example for article00245
and others. It could be if the type isResearch article
then the link is shown? I don't know how the production environment of the site responds if the link is clicked.
Thank you spotting this. It would appear that the link is present for all research articles, but if they don't have any figures, then it results in an error. I suppose this is something that would require modifications to the frontend of the website to solve?
I also noticed that some references have a lot of extra characters, e.g. in https://microsimulation.pub/articles/00245
, , , Den Privata Konsumtionen 1950-1970, Industriens Utredningsinstitut, , Stockholm, .
These extra characters show up in the json file:
"details": "\n, \n, \n, Den Privata Konsumtionen 1950-1970, Industriens Utredningsinstitut, \n, Stockholm, \n",
but not in the xml. Is this something that could be resolved in the conversion process, or there is some problem with the xml?
I'm not exactly sure how the Figures and data
link logic works, I would investigate the front-end code first, and I couldn't find a reason in the JSON for why it would appear.
For the extra newline \n
sequences, It looks like they appear on citations of type unknown
, and when those are generated the parser is keeping the newline white space characters from the XML. The unknown
type is generated when the incomplete data will cause a validation error. I might be able to pinpoint the reason and try to mitigate it by removing or omitting the extra newlines, what do you think?
I might be able to pinpoint the reason and try to mitigate it by removing or omitting the extra newlines, what do you think?
Thank you, that would be great.
The unknown type is generated when the incomplete data will cause a validation error.
It sounds like a problem with the XML then? If you could let me know which elements are missing and causing the validation error I'll try to make sure they are there the next time.
The "type": "unknown"
references may be hard to resolve, there are many of them. It was to allow older eLife XML to be fit into the more strict schema without changing the older XML. The reason vary but I can provide more details if you want to look into it.
~For example, article 00245
citation bib7
is type book
in the XML. According to the parser logic I would guess it does not have a book title in the XML, and is therefore formatted as an unknown
type reference since it would fail to validate as a book
reference.~
Clarification to the above, it does not have publisher
data, which would cause it to not be a validate book
reference.
Clarified the above example, it is missing a publisher
. (specifically a <publisher-name>
XML tag, I think - it shows how difficult it is to diagnose each citation)
The \n
character replacement attempt worked well, a new PR with corrected JSON output https://github.com/microsimulation/ijm/pull/150.
Hi @gnott ,
Would you be able to convert IJM Issue 15(1) for us please?
It's the following files from the S3 bucket:
| ijm-00244.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 63.8 KB | Standard | ijm-00245.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 87.3 KB | Standard | ijm-00246.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 217.6 KB | Standard | ijm-00247.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 1.3 MB | Standard | ijm-00248.zip | zip | May 26, 2022, 15:50:18 (UTC+01:00) | 409.3 KB | Standard | ijm-00249.zip | zip | May 26, 2022, 15:50:17 (UTC+01:00) | 683.7 KB | Standard | ijm-00250.zip | zip | May 26, 2022, 15:50:17 (UTC+01:00) | 315.8 KB | Standard | ijm-00251.zip | zip | May 26, 2022, 15:50:18 (UTC+01:00) | 263.9 KB | Standard | ijm-00252.zip | zip | May 26, 2022, 15:50:11 (UTC+01:00) | 194.3 KB | Standard | ijm-00253.zip | zip | May 26, 2022, 15:50:12 (UTC+01:00) | 44.9 MB | Standard | ijm-00254.zip | zip | May 26, 2022, 15:50:18 (UTC+01:00) | 123.6 KB | Standard | ijm-00255.zip | zip | May 26, 2022, 15:50:19 (UTC+01:00) | 297.7 KB | Standard | ijm-00256.zip | zip | May 26, 2022, 15:50:19 (UTC+01:00) | 339.3 KB | Standard | ijm-00257.zip | zip | May 26, 2022, 15:50:19 (UTC+01:00) | 558.1 KB | Standard | ijm-00258.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 5.1 MB | Standard | ijm-00259.zip | zip | May 26, 2022, 15:50:20 (UTC+01:00) | 185.3 KB | Standard
Thank you!