Closed cclauss closed 2 months ago
@cclauss number_of_pages
should be a number / integer: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/edition.schema.json#L91
Those strings with roman numerals look like they would be valid in the pagination
field, which is a string https://www.w3.org/2001/sw/wiki/Library_terminology_informally_explained#pagination
Thanks for the pointers Charles. We think that we know which bot is creating these records so we will try to implement a fix.
@cclauss There are two fields on the form, one is for page count ('How many pages?') and that is strictly integers. The other is for pagination and this field is unrestricted to allow for roman numerals, volume count, maps, etc.
@seabelis To be fair, we've never been entirely consistent about this (please pay no attention to the small minds' bugaboos). Should a pagination of "xxii, 550" be accompanied by an integer page count of 550, or more correctly 572?
@LeadSongDog Probably, yes. But in all honesty I pay more attention to the string field as I find it more helpful. I don't always complete the integer field.
The pagination field makes no sense to a computer, so you'll never be able to sort by length, etc without the number_of_pages
field. I would expect "XXII, 516" to result in a number of pages value of either 538 or perhaps a slightly higher number equivalent to 2 x # of leaves.
As an integer field, it should never be equal to "". I would expect that an unknown value would be encoded by the field not being present (or perhaps have a value of None
).
Note that anything that has been scanned has an exact page count that can be picked up in from the scanning data, so no need for human entry.
@guyjeangilles Can you please try to discover which of our import bots is bringing in roman numbers into an integer field?
The problem looks like it could be coming from this code, if this script is active for ImportBot: https://github.com/internetarchive/openlibrary-bots/blob/master/BWBImportBot/import-ol.py#L42
Although I don't know how that code is supposed to work - I don't think pagination
is ever set for the input to that script. It looks wrong, but may never actually trigger. It's perhaps more likely that pagination non-integer data has been entered in the wrong field on some source records.
@mekarpeles I see there has been a change to the bots repo code from what I wrote originally in the parse-biblio.py
script:
Original:
Current:
pagination
is the correct OL destination field to use when copying that data. The committed version in openlibrary-bots looks like a WIP straddling python 2 and 3, so I'm not sure if that code is running anywhere, but if the same source has been copied and is running elsewhere, it is populating the wrong field.
https://github.com/internetarchive/openlibrary-bots/blob/master/BWBImportBot/import-ol.py#L41-L42 also looks like a work around attempt to use pagination
for number_of_pages
, which is also incorrect.
Source data description of that field confirming roman numerals are allowed: "Total Number of pages in format: Roman page count (if present), Arabic page count (i.e., xxiii, 531)"
The original script is Python3 and committed (non-public) on my Archive gitlab. The first commit in openlibrary-bots includes modifications I didn't make.
@hornc Given your understanding of this code, could you please create a pull request that fixes this issue?
Question
Can we parse this data into valid
number_of_pages
(int) for the following books?The data for
number_of_pages
for the first book is coming through asXXII, 516
. Python does not like that... ValueError infogami.infobase.common in parse_data errorinvalid literal for int() with base 10: ''Does this book have 516 pages? Is the same true for the other three books below? Should we add a bit of code to get rid of the roman numerals and set the
number_of_pages
to 516? Or perhaps the book has 22 pages (XXII)!! Should we have a bot that seeks out this kinda data and cleans it up?For extra credit... Sometimes
number_of_pages
is the empty string. Should this code just returnnumber_of_pages=0
?Additional context
Stakeholders