internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.18k stars 1.35k forks source link

Roman numbers are not valid for number_of_pages #4204

Closed cclauss closed 2 months ago

cclauss commented 3 years ago

Question

Can we parse this data into valid number_of_pages (int) for the following books?

The data for number_of_pages for the first book is coming through as XXII, 516. Python does not like that... ValueError infogami.infobase.common in parse_data errorinvalid literal for int() with base 10: ''

python3 -c "int('XXII, 516')"
[ ... ]
ValueError: invalid literal for int() with base 10: 'XXII, 516'

Does this book have 516 pages? Is the same true for the other three books below? Should we add a bit of code to get rid of the roman numerals and set the number_of_pages to 516? Or perhaps the book has 22 pages (XXII)!! Should we have a bot that seeks out this kinda data and cleans it up?

For extra credit... Sometimes number_of_pages is the empty string. Should this code just return number_of_pages=0?

Additional context

Stakeholders

Screen Shot 2020-12-01 at 21 11 41 Screen Shot 2020-12-01 at 21 12 48 Screen Shot 2020-12-01 at 21 14 06 Screen Shot 2020-12-01 at 21 13 25

hornc commented 3 years ago

@cclauss number_of_pages should be a number / integer: https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/edition.schema.json#L91

Those strings with roman numerals look like they would be valid in the pagination field, which is a string https://www.w3.org/2001/sw/wiki/Library_terminology_informally_explained#pagination

cclauss commented 3 years ago

Thanks for the pointers Charles. We think that we know which bot is creating these records so we will try to implement a fix.

seabelis commented 3 years ago

@cclauss There are two fields on the form, one is for page count ('How many pages?') and that is strictly integers. The other is for pagination and this field is unrestricted to allow for roman numerals, volume count, maps, etc.

LeadSongDog commented 3 years ago

@seabelis To be fair, we've never been entirely consistent about this (please pay no attention to the small minds' bugaboos). Should a pagination of "xxii, 550" be accompanied by an integer page count of 550, or more correctly 572?

seabelis commented 3 years ago

@LeadSongDog Probably, yes. But in all honesty I pay more attention to the string field as I find it more helpful. I don't always complete the integer field.

tfmorris commented 3 years ago

The pagination field makes no sense to a computer, so you'll never be able to sort by length, etc without the number_of_pages field. I would expect "XXII, 516" to result in a number of pages value of either 538 or perhaps a slightly higher number equivalent to 2 x # of leaves.

As an integer field, it should never be equal to "". I would expect that an unknown value would be encoded by the field not being present (or perhaps have a value of None).

Note that anything that has been scanned has an exact page count that can be picked up in from the scanning data, so no need for human entry.

cclauss commented 3 years ago

@guyjeangilles Can you please try to discover which of our import bots is bringing in roman numbers into an integer field?

hornc commented 3 years ago

The problem looks like it could be coming from this code, if this script is active for ImportBot: https://github.com/internetarchive/openlibrary-bots/blob/master/BWBImportBot/import-ol.py#L42

Although I don't know how that code is supposed to work - I don't think pagination is ever set for the input to that script. It looks wrong, but may never actually trigger. It's perhaps more likely that pagination non-integer data has been entered in the wrong field on some source records.

hornc commented 3 years ago

@mekarpeles I see there has been a change to the bots repo code from what I wrote originally in the parse-biblio.py script:

Original: image

Current:

https://github.com/internetarchive/openlibrary-bots/blob/de10f621fd5ef49e9e85fcf6abe96e5bb9101984/BWBImportBot/parse-biblio.py#L75-L79

pagination is the correct OL destination field to use when copying that data. The committed version in openlibrary-bots looks like a WIP straddling python 2 and 3, so I'm not sure if that code is running anywhere, but if the same source has been copied and is running elsewhere, it is populating the wrong field.

https://github.com/internetarchive/openlibrary-bots/blob/master/BWBImportBot/import-ol.py#L41-L42 also looks like a work around attempt to use pagination for number_of_pages, which is also incorrect.

Source data description of that field confirming roman numerals are allowed: "Total Number of pages in format: Roman page count (if present), Arabic page count (i.e., xxiii, 531)"

The original script is Python3 and committed (non-public) on my Archive gitlab. The first commit in openlibrary-bots includes modifications I didn't make.

cclauss commented 3 years ago

@hornc Given your understanding of this code, could you please create a pull request that fixes this issue?