Closed scottbarnes closed 3 days ago
"prepack" is another term to potentially block.
In the source data which is being imported there is some code to check the format:
and
Is there any way these records can be traced back to the source data and find the primary_format
used? There maybe some other codes that should be added to that non-book list.
I think the solution may be extending the https://github.com/internetarchive/openlibrary/blob/master/scripts/partner_batch_imports.py#L237 quality checks for the partner imports that run monthly the ~15th.
You should be able to find the data via archive.org items referenced in olsystem etl and ol-home0:/1/
from looking at one data file, there were 264 titles containing "Dumpbin". Most of them were marked as 'Trade Paper'. 3 were marked as e-books.... so that kills my idea of having a format accept-list rather than the current NONBOOK reject list.
The data is not accurately annotated to be clear about what kind of item is being imported, which is disappointing.
Problem
Sometimes records with the requisite fields (
title
,authors
,publishers
,publish_date
, andsource_records
) are nonetheless not books, the title will contain strong evidence of this. Consider the following recent imports:All meet the minimum criteria for import, and all are not books. "Bin", "dumpbin", "x copy", and "poster" are all terms for non-book display items.
Reproducing the bug
No response
Context
No response
Notes from this Issue's Lead
Proposal & constraints
It might be possible to use a regex or otherwise match the end of the
title
field to see if it ends indumpbin
,bin
, orposter
(note the leading space), but we'd want to ensure we don't block false positives.Perhaps as a test it would be possible to parse the Works dump, available at https://openlibrary.org/developers/dumps, do a basic analysis to see what would be block from import this basic title search.
Directions for importing books locally can be found at https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing, but as a threshold matter it likely makes sense to test out the proposed solution using the data dump before implementing any solution in the Open Library code base.
Related files
Stakeholders
@seabelis
Instructions for Contributors