internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.26k stars 1.4k forks source link

Import Wikisource trusted book provider data #9671

Open pidgezero-one opened 4 months ago

pidgezero-one commented 4 months ago

Problem

Followup to https://github.com/internetarchive/openlibrary/issues/8545. Currently there are 60 books in OL that have Wikisource IDs. IDs are formatted as langcode:title (i.e. en:George_Bernard_Shaw). Import Wikisource works into Open Library.

https://github.com/internetarchive/openlibrary/wiki/Developer's-Guide-to-Data-Importing

Proposal & Constraints

Hit English Wikisource's API and paginate through result sets of hits that fall under the "Validated texts" category: https://en.wikisource.org/w/api.php?action=query&generator=categorymembers&gcmtitle=Category:Validated_texts&gcmlimit=500&prop=categories|info|revisions&rvprop=content&rvslots=main&format=json&cllimit=max

The response includes documents that aren't books. Books are not flagged with a distinct category. We may have to also browse Wikisource's API to manually draft a list of categories that we should ignore any member of, such as Subpages (individual chapters of books), Posters, Songs, etc.

The response includes most of the work's metadata (author, year, etc) as wiki markup of the page's infobox. Consider using a library like wptools to parse it.

In the future, we will want to expand this import to support other languages besides en.wikisource.org, and perhaps expand beyond the Validated texts category, so the solution to this should be extensible.

A potential downside to how we derive Wikisource IDs is that we use the page title, which is modifiable, instead of the page's canonical ID, and that leaves us at the mercy of Wikisource's works being moved or having their names changed. This will likely be a pretty rare (if ever) occurrence, but if we ever decide we want to use canonical IDs instead, any wikisource item can be obtained with curid. Example: https://en.wikisource.org/?curid=4496925 and https://en.wikisource.org/wiki/%22Red%22_Fed._Memoirs are the same page. (This is also an example of a page whose title may need to be URLencoded in outbound links.)

Leads

Stakeholders

@cdrini @pidgezero-one


hornc commented 4 days ago

I have a few questions about this feature,

Knowing whether this the main value of this feature is to:

would possibly help focus effort.

Some Wikisource texts appear to come from Project Gutenberg texts, and that makes me worry about some of the lack-of-provenance issues such PD texts might have. I'm not 100% sure how we do handle Project Gutenberg texts on OL, are they their own editions, do they change over time? That's probably a different topic though.

pidgezero-one commented 4 days ago
  • it's not completely clear to me whether the en:George_Bernard_Shaw id is really a portable identifier or a really a URL equivalent (wiki + page title, which can change), or how it can be used to compare with other data sources that might list a 'Wikisource identifier'. The numeric ids look more like identifiers, but also are language wikisource specific, so there really isn't a single 'Wikisource identifier' 112842 is the 'George Bernard Shaw' book on en-wikisource, but it's something completely different on Ukrainian Wikisource.

I don't love the lang:title identifier format, personally. In the script in my open PR, I originally tried to use the numeric ID like the one you've identified. I stuck with lang:title here for two reasons: less so, I couldn't get the numeric identifier to resolve to the outbound links in the download options section for Wikisource books, and more so, it's already the identifier format that the small selection of existing Wikisource books in OL use (same example).

Determining what is a 'book' on Wikisource does seem complicated, and it's not stated clearly. Pages on Wikisource appear to represent 'Works', but are generally expected to have a source published Edition -- I don't know if the edition can be changed in principle? I think that means Wikisource is not a publisher, so Wikisource will not be the only source for these books.

Wikisource/Wikidata not explicitly differentiating what counts as a "book" has been a real thorn in my side. For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.

get more books into Open Library that OL does not have

This was my understanding of the main purpose here when Drini and I were first discussing the project. I've been writing the import record script with the understanding that we'd like to import items from more Wikisource language bases than just English in the future.

tfmorris commented 3 days ago

I share @hornc 's concerns and would like to see this much more tightly specified.

Import Wikisource works into Open Library.

is a pretty terse description of request which could take a variety of different forms.

Wikisource is mostly made up of transcriptions of specific editions (not works), although, as @hornc points out PG editions are a bit of a wild card because they are editions without any provenance information which are intentionally unassociated with existing editions.

Is the intention to create new digital editions for the transcriptions which are derived from the original edition? Or is the intention just to make the transcription some type of digital proxy for the original edition? Wikisource, as with most things wiki*, seems a bit ambiguous, but seems to lean towards the latter model (ie they include a link to the Wikidata entity for the transcribed edition, but don't model the transcription separately).

Complicating this is the fact that Wikidata is generally poor at modeling book metadata. It's not a huge deal because it doesn't have much it it, but some of the logical conflicts you'll see include:

Using or linking to one of these conflated entities extends the mess because the new connection usually requires (or implies) either an edition or a work, but not both.

I would suggest that Wikisource transcriptions should actually be modeled independently from the editions that they transcribe, but that would require the buy-in/support of both the Wikisource and Wikidata communities. Certainly if OL considers exactly digital facsimiles from CreateSpace, etc, to be separate editions and transcription would definitely be considered a separate edition (but OpenLibrary's data model isn't rich enough to connect the two derived editions together, as far as I know).

Has anyone looked at how many of the transcribed editions are NOT already in OpenLibrary? My assumption is that the vast majority of them are, so perhaps focusing on @hornc 's suggestion of closing the loop on IA/OL editions would be a good place to start.

For example, I don't know if this counts as a book for the purposes of OL, but its Wikidata properties don't really distinguish it apart from what would be considered a book.

I would consider it a transcribed derivative of OL23268596M / ia:addresstomaryade00scot which was authored by Q16944048 (no associated OL ID in Wikidata, but appears to be OL6627737A). Given that OL & IA each have (separate) catalog records with the metadata and IA has scanned page images as well as OCR'd text, which is expected to be derived from Wikisource? Just a link or an alternative text version or some set of metadata or ... ? It might be tempting to infer equivalence of author IDs, but that seems risky absent other evidence than cooccurrence.