internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

Add Open Library record for every work/edition in Internet Archive 2020 Wishlist #869

Closed mekarpeles closed 5 years ago

mekarpeles commented 6 years ago

There are 2.6M items on the Internet Archive Openlibraries Wishlist. We want to make sure each of these books has a corresponding catalog entry in openlibrary.org.

Steps

See: https://archive.org/details/open_libraries_wish_list https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_1.csv.zip

@thisismattmiller -- can get a master csv which collates isbn10, isbn13, and oclc (into one row)?

We want to add all of these records into Open Library so we can query by any of these identifier fields and retrieve the metadata for the book.

Also, do we have metadata on these books?

mekarpeles commented 6 years ago

We may want to add all of these to a subject a la https://openlibrary.org/subjects/internet_archive_wishlist

thisismattmiller commented 6 years ago

All data for the wish list is found in the sqlite database in my home directory (/home/mattmiller/sqlite_database) there are example scripts in there for serialization. Field information is found in this google sheet: https://docs.google.com/spreadsheets/d/1GDATWbgncmQzDaTJVdJU1kVcRhJIuMs0zHUnoCITED0/edit?usp=sharing

The basic SQL query to get everything that will appear on the wish list is: SELECT * FROM data where flagged_author == 0 and flagged_publisher == 0 and no_author == 0 and no_publisher == 0

One thing to know is that the wish list tried to be as inclusive as possible, so it used the field classify_related_print_editions to add additional ISBNs (often of non-english languages) to the final list. These isbns would not have their own row in the DB but will appear in the wish list.

Basic JSON serialization of the wish list can also be found in /home/mattmiller/wish_list

tfmorris commented 6 years ago

We may want to add all of these to a subject a la https://openlibrary.org/subjects/internet_archive_wishlist

Noooo. No more fake subjects! They're worse than fake news. "In my little sparkly library" is not a subject. A subject is something that a book is about like "French politics" or "Himalayan mountains."

tfmorris commented 6 years ago

What is the basis for these 2.6 million wishes? ie where did they come from?

mekarpeles commented 6 years ago

@tfmorris -- got it, no subject :)

We've been using subject instead of list because our lists are unscalable/broken/challenging to fix. A new db entry is added every time a seed is added to a list causing huge db bloat. We've been experimenting w/ a new lists db (which is being used to power our "Want to Read", "Already Read", etc. Reading Log feature) which is not backed by infogami.

The list comes from here: https://blog.archive.org/2018/03/14/lets-build-a-great-digital-library-together-starting-with-a-wishlist/

mekarpeles commented 6 years ago

One thing to know is that the wish list tried to be as inclusive as possible, so it used the field classify_related_print_editions to add additional ISBNs (often of non-english languages) to the final list. These isbns would not have their own row in the DB but will appear in the wish list.

@thisismattmiller, is this also the case (i.e. isbn synonyms are included) in the wish_list_march_2018.ndjson?

If so, would you be able to generate a copy of the json that does not have synonyms? In order to import into Open Library, we'll need the exact book metadata / isbn, etc

thisismattmiller commented 6 years ago

Sorry I missed this notification.

Yes, that would be possible to exclude the synonyms. In /home/mattmiller/sqlite_database/scripts/serialize_basic.py starting on line 200 there is:

            # see if we haver other print versions available to add to this record
            if row['classify_related_print_editions'] is not None:
                row['classify_related_print_editions'] = json.loads(row['classify_related_print_editions'])
                for e in row['classify_related_print_editions']:
                    for isbn in e['isbn']:
                        if isbn not in added_isbn:
                            if int(isbn) in have_lookup:
                                skiped_isbns+=1
                            else:
                                # overwrite the obj and add this new one in
                                obj['isbn13'] = isbn
                                obj['isbn10'] = to_isbn10(isbn)
                                obj['oclc'] = e['oclc']
                                obj['language'] = e['language']
                                added_isbn[isbn] = True
                                # write it out
                                out_json.write(json.dumps(obj)+'\n')
                                out_csv_writer.writerow([obj['isbn13'],obj['isbn10'],obj['oclc'],obj['language'],obj['title'],obj['date']," | ".join(obj['author'])])
                        else:
                            # print('already added',isbn)
                            pass

        else:

            #this was not a print version but there might be related print versions we have collected
            # see if we haver other print versions available to add to this record
            if row['classify_related_print_editions'] is not None:
                row['classify_related_print_editions'] = json.loads(row['classify_related_print_editions'])
                for e in row['classify_related_print_editions']:
                    for isbn in e['isbn']:
                        if isbn not in added_isbn:

                            if int(isbn) in have_lookup:
                                skiped_isbns+=1
                            else:
                                # overwrite the obj and add this new one in
                                obj['isbn13'] = isbn
                                obj['isbn10'] = to_isbn10(isbn)
                                obj['oclc'] = e['oclc']
                                obj['language'] = e['language']
                                added_isbn[isbn] = True
                                # write it out
                                out_json.write(json.dumps(obj)+'\n')
                                out_csv_writer.writerow([obj['isbn13'],obj['isbn10'],obj['oclc'],obj['language'],obj['title'],obj['date']," | ".join(obj['author'])])

                        else:
                            # print('already added',isbn)
                            pass

Just comment that out and run it and it will not populate out the classify related titles aka isbn synonyms. -Matt

sbshah97 commented 6 years ago

@thisismattmiller what would be an easy way to separate out Editions from Works on the Wishlist such that we include just a single Work (from Multiple Editions).

thisismattmiller commented 6 years ago

In the SQLite DB you would need to limit it to only things that: has_classify = 1 And then collapse rows together that have the same classify_work_id value.

The classify_work_id is a work identifier, so if two editions have the same id it is the same work.

mekarpeles commented 6 years ago

Opening until we've had a chance to process the remaining Wishlist works (and add them to OL)

tfmorris commented 6 years ago

The more I look at this list, the more suspicious of it become. I was going to say it was English & North American biased, but I haven't found any evidence that it includes anything except English. At OpenLibrary we've been pushing for increased diversity, so it'd be sad to see Internet Archive's influence reverse that. Diversity is important.

There are a bunch of references to a private database which apparently contains more information than the public CSVs. Is there a reason for this lack of transparency? Can we get a dump of all the metadata available?

In looking at https://archive.org/download/open_libraries_wish_list/wish_list_isbn13_ver_3_provenance.tsv.zip which seems to contain the most information available, the ISBNs from Library Link seem to be completely disjoint with all the other sources which is exceedingly odd. Is it accurate?

I have many more questions and concerns, but with access to the metadata I could probably answer the questions myself (and figure out if the concerns are justified).

mekarpeles commented 6 years ago

At OpenLibrary we've been pushing for increased diversity

Agreed -- p.s. one experiment I'm trying to push is on-the-fly per-page translation of each of our books using a translation API.

We have several different projects in the mix which are not mutually exclusive w/ this Open Libraries Wishlist. This wishlist is for a very specific program (openlibraries.online).

We are independently funding large e.g. LGBT collections, Indian collections, etc. These just happen to be separate initiatives which are independently funded.

I think we're unlikely to see a huge multi-lingual emphasis in the openlibraries specific push. I think IA is focusing on books a lot of libraries have for this initiative, as a way to potentially help them migrate online. It would be great if we could build a defensible library system any library could contribute to a la the Open Content Alliance back in 2008 cerca OL's inception: http://www.infotoday.com/searcher/jan08/Ashmore_Grogg.shtml

tfmorris commented 5 years ago

OK, so it's US English only and that's not going to change. Can we at least get access to the metadata so we can judge its quality?

tfmorris commented 5 years ago

I propose the OpenLibrary defer this task until Internet Archive is more transparent about this list. I heard a (hyperlocal) podcast just the other day touting how OpenLibrary could benefit the low income marginalized in Africa, Asia, etc. An American community library "wish" list isn't at all relevant to them.

hornc commented 5 years ago

Some stats, Out of 2000 randomly sampled English books on the Wishlist: 239 NOT FOUND 12% (not found on OL or elsewhere) 203 CREATED 10% (Record created from new metadata) 1558 FOUND 78% (OL already had a record for the ISBN)

Out of 2000 randomly sampled Non-English ISBNs on the Wishlist: 822 NOT FOUND 40% (not found on OL or elsewhere) 289 CREATED 15% (Record created from new metadata) 888 FOUND 45% (OL already had a record for the ISBN)

75% of the ~1.5M ISBNs are English, 12% are German, which seems to be the next biggest category. I'm planning on creating a tool to analyse ISBN lists by allocated Agency (which in most cases equates to country), so we can get a better indication of book sources from any large list of ISBNs. These are some other counts sampling various countries: fra=38516 ger=184027 jap=6922 former-ussr=15114 ind=6725 ita=26585 mex=3352

Importing the Non-English ISBNs into OL seems to give a slightly higher new item rate. The most obvious source of non-English items on the wishlist is probably the international Wikipedia citation lists that were used (items with more than one citation). The bulk of the wishlist is already on OL, so this task is about filling in the gaps, and the biggest gap is in the non-English books.

If there are good quality sources of diverse bulk ISBNs (or full metadata!) we can import, point me to them and I can run the imports in parallel.

mekarpeles commented 5 years ago

I think @hornc finished this!