internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

Clean up records with future publication dates #4568

Open SaravgiYash opened 3 years ago

SaravgiYash commented 3 years ago

Evidence / Screenshot (if possible)

Many works have wrong year of publication (Like 9999, 2049, 2040....)

See: https://openlibrary.org/search?q=publish_year%3A%5B2025+TO+*%5D

image

Relevant url?

https://openlibrary.org/search?q=mark&mode=everything&sort=new https://openlibrary.org/works/OL21132031W/Classical_Music_Picture_Book?edition= https://openlibrary.org/works/OL21486637W/Making_Sense_of_Politics?edition=

Details

Proposal

Use first_publish_year:[2025 TO *] in solr, e.g. https://openlibrary.org/search.json?q=first_publish_year%3A%5B2025+TO+*%5D, to find future dates

Stakeholders

Bhavna777 commented 3 years ago

I would like to work on this issue.

Bhavna777 commented 3 years ago

Can I start ?

SaravgiYash commented 3 years ago

I would like to work on this issue.

Okay, but it would be better if you choose one issue at a time and after filing a PR for that issue you can start working on it.

Bhavna777 commented 3 years ago

Ok Thank You

Bhavna777 commented 3 years ago

@Yashs911 Can you please help me to solve this issue Actually I'm newbie to Open Library, That's why I could not find the file where I should change 😁

SaravgiYash commented 3 years ago

@Bhavna777 Actually, I don't know the root cause, so I don't know where we should start. As per https://github.com/internetarchive/openlibrary-librarians/issues/1 and some other issues linked to this. I will suggest that we hide the publication year >= 2021 for the time being.

seabelis commented 3 years ago

Added to librarians repo for manual correction. https://github.com/internetarchive/openlibrary-librarians/issues/53

SaravgiYash commented 3 years ago

@seabelis Actually this issue is not just related to https://openlibrary.org/search?q=mark&mode=everything&sort=new but many books on OL have the wrong publication year so I was wondering if it was possible to hide publications year > 2021

Bhavna777 commented 3 years ago

@seabelis Actually this issue is not just related to https://openlibrary.org/search?q=mark&mode=everything&sort=new but many books on OL have the wrong publication year so I was wondering if it was possible to hide publications year > 2021

But it will create problem in the upcoming years.

SaravgiYash commented 3 years ago

But it will create problem in the upcoming years.

By 2021 I meant we can use Current Year function

seabelis commented 3 years ago

I'm not the person to decide, but I'd prefer to delete the incorrect data than to hide it.

mekarpeles commented 11 months ago

@scottbarnes can you confirm whether this can be closed now re: 9999?

mekarpeles commented 11 months ago

I'm re-purposing this issue to clean up works that have future dates.

https://openlibrary.org/query.json?type=/type/edition&publish_date~=9999*&limit=1000

or first_publish_year:[2025 TO *] in solr, e.g.:

https://openlibrary.org/search?q=first_publish_year%3A%5B2025+TO+*%5D&mode=everything&sort=new

Proposal

scottbarnes commented 11 months ago

It may be helpful to keep a record of items we've so modified in case we later want to go back and, for example, reimport them or otherwise modify them further, and this way it will be easy to identify the ones from which we've removed publish_date.

cdrini commented 11 months ago

@hornc notes that he is planning on removing all the 9999 dates in a bulk process. I believe this would tackle the bulk of the problem let us see...

cdrini commented 11 months ago

There are about 5,868 editions with publish year 9999, and another 15,707 with publish years after 2025 but not 9999. Flipping through them it's unclear why exactly they have these weird dates and whether they should be deleted :confused: I think fixing the 9999 set is a good first stab. Would you be able to keep a list of the editions your script edits, and upload it to the issue? We might want to do further investigation on these editions later, and having a way to find them would be useful!

hornc commented 11 months ago

One cause of the 9999 problem relates to MARC imports and the existing issue: #2711 I started cleanup and noticed a number of 9999 dates originate from Harvard MARC records where the 9999 is in the 008 field, but there is a correct publication date (often) in 260$c

https://openlibrary.org/books/OL45340001M/%CA%BBAlimi_aman_jo_Islami_manshur?m=history

and

https://openlibrary.org/show-records/harvard_bibliographic_metadata/ab.bib.12.20150123.full.mrc:583443956:436

I'll see if there is a way to easily add the correct dates as a go, and look at patching the MARC import hole. -->

See PR: #8448

hornc commented 10 months ago

@mekarpeles I believe all the 9999 dates have been removed from Open Library.

hornc commented 10 months ago

A lot of the remaining future dates are simply spam: e.g. https://openlibrary.org/search?q=first_publish_year%3A%5B2025+TO+*%5D+Customer+Service+number&mode=everything&sort=new

and

https://openlibrary.org/search?q=first_publish_year%3A%5B2025+TO+*%5D+bitcoin&mode=everything&sort=new

And there are other variations