internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.08k stars 1.32k forks source link

Non-book retail display items are being imported #9768

Closed scottbarnes closed 3 days ago

scottbarnes commented 3 weeks ago

Problem

Sometimes records with the requisite fields (title, authors, publishers, publish_date, and source_records) are nonetheless not books, the title will contain strong evidence of this. Consider the following recent imports:

All meet the minimum criteria for import, and all are not books. "Bin", "dumpbin", "x copy", and "poster" are all terms for non-book display items.

Reproducing the bug

No response

Context

No response

Notes from this Issue's Lead

Proposal & constraints

It might be possible to use a regex or otherwise match the end of the title field to see if it ends in dumpbin, bin, or poster (note the leading space), but we'd want to ensure we don't block false positives.

Perhaps as a test it would be possible to parse the Works dump, available at https://openlibrary.org/developers/dumps, do a basic analysis to see what would be block from import this basic title search.

Directions for importing books locally can be found at https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing, but as a threshold matter it likely makes sense to test out the proposed solution using the data dump before implementing any solution in the Open Library code base.

Related files

Stakeholders

@seabelis


Instructions for Contributors

seabelis commented 3 weeks ago

"prepack" is another term to potentially block.

hornc commented 3 weeks ago

In the source data which is being imported there is some code to check the format:

https://github.com/internetarchive/openlibrary/blob/0293aca7ff55134807572a7e256987de537292be/scripts/partner_batch_imports.py#L112-L116

and

https://github.com/internetarchive/openlibrary/blob/0293aca7ff55134807572a7e256987de537292be/scripts/partner_batch_imports.py#L162-L164

Is there any way these records can be traced back to the source data and find the primary_format used? There maybe some other codes that should be added to that non-book list.

mekarpeles commented 3 weeks ago

I think the solution may be extending the https://github.com/internetarchive/openlibrary/blob/master/scripts/partner_batch_imports.py#L237 quality checks for the partner imports that run monthly the ~15th.

You should be able to find the data via archive.org items referenced in olsystem etl and ol-home0:/1/

hornc commented 1 week ago

from looking at one data file, there were 264 titles containing "Dumpbin". Most of them were marked as 'Trade Paper'. 3 were marked as e-books.... so that kills my idea of having a format accept-list rather than the current NONBOOK reject list.

The data is not accurately annotated to be clear about what kind of item is being imported, which is disappointing.