openzim / cms

ZIM file Publishing Platform
https://cms.openzim.org
GNU General Public License v3.0
4 stars 0 forks source link

Add book-title reconciliation mechanism #25

Closed rgaudin closed 2 years ago

rgaudin commented 2 years ago

Required for #24, a mechanism to find the appropriate Title for a book's metadata, if there is such a title in the database.

rgaudin commented 2 years ago

First of all, book-title reconciliation should happen during book creation. If it fails (not likely), then we'll end up with dangling books that will need to be assigned Titles to manually.

Titles are identified by an ident property which we want to mimic Zimfarm's recipes.

Should we receive a valid (see format bellow) recipe property in the endpoint, we should use it. I will check with @kelson42 whether we want Zimfarm to submit its recipe name to book-add endpoint.

The title/Zimfarm-recipe format looks like: {project}_{lang}[_{classifier}]

Sample existing, valid recipe names:


This format is close to the Name metadata (which is mandatory) found in Books. Reconciliation will thus be from Name metadata to Tilte.ident.

I don't think we want to fallback to using filenames but will check. Those usually look like {Name}_{period-year}-{period-month}.zim.

When creating Titles (when we're confident there is none for this book), we'll probably reuse that name but we need to ensure the format is valid with what's above.

rgaudin commented 2 years ago

I think a good exercice for this tool would be to try and reconcile all the existing Zimfarm recipes' names (JSON here) with the ZIM catalog's books names (XML here)

curl https://api.farm.openzim.org/v1/schedules/backup/ | jq '.[].name'
curl -L http://download.kiwix.org/library/library_zim.xml | python3 -c 'import sys; import json; from xml.dom.minidom import parseString; dom = parseString(sys.stdin.read()); [print(book.getAttribute("name")) for book in dom.getElementsByTagName("book")]'
anshulxyz commented 2 years ago

@rgaudin I have some confusion.

1.

find the appropriate Title for a book's metadata

I assume this is different from the "Title" key in the Metadata. Because that looks like "English Wikipedia". While what I see in your comments look like "Name" key in the Metadata. Because that looks like wikipedia_fr_football. That's what the Zimfarm-recipe looks like.

Titles are identified by an ident property which we want to mimic Zimfarm's recipes.

Again when I look at the Zimfarm recipes I see that it has "Name"

Can you tell me what if I have understood correctly, that when you mean Title here, you actually mean Name in Zimfarm's recipes.

Should we receive a valid

I know I can use the find_language_names to see if a valid language code was passed, except for mul.

Not sure how I'll test for valid project name and classifier, unless I have a list of allowed keywords/names.

rgaudin commented 2 years ago

find the appropriate Title for a book's metadata

I assume this is different from the "Title" key in the Metadata. Because that looks like "English Wikipedia". While what I see in your comments look like "Name" key in the Metadata. Because that looks like wikipedia_fr_football. That's what the Zimfarm-recipe looks like.

Yes, the reconciliation is between a ZIM's Name metadata and a Book ident.

Titles are identified by an ident property which we want to mimic Zimfarm's recipes.

Again when I look at the Zimfarm recipes I see that it has "Name"

Can you tell me what if I have understood correctly, that when you mean Title here, you actually mean Name in Zimfarm's recipes.

It's important to understand that Title has two meaning. As noted in the Wiki, the Title model reference an Entry in the CMS/Catalog/Library. Following the Library analogy, Barack Obama's A promised land is a Title for which different books exists: Hard cover, paperback, audiobook, e-book. Each of these books have an Author, an Editor and a title.

In the CMS, we'll have Title instances represented by an ident like wikipedia_fr_all which would expose different Book instances (ZIM files) identified by a UUID.

The Title ident that we need to build is should resemble Zimfarm recipes (internally called schedules) names. Those are visible on the recipes page

Screen_Shot_2021-11-23_at_08_16_47

Should we receive a valid

I know I can use the find_language_names to see if a valid language code was passed, except for mul.

Correct

Not sure how I'll test for valid project name and classifier, unless I have a list of allowed keywords/names.

Those are flexible so there is no allowed list. We'll just use a slug-like regex like [a-z0-9_\-\.]+.

anshulxyz commented 2 years ago

@rgaudin

These are the patterns I have noticed

  1. Wikipedia/Wikimedia sources 1.1 First we can check for exact match project-name_lang_classifier 1.2 If no match, then we can check for project-name_lang or project-name_lang_all 1.3 If no match, then we can check for project-name

  2. Khan Academy 2.1 There was one recipe that didn't follow the hyphen case name khanacademy-fr-kolibri 2.2 Khan Academy book name follows this pattern khan-academy-videos_{lang}_playlist-<playlist-code> This book name should be matches with recipe pattern of khan-videos-{lang}_playlists

  3. Others

book name recipe name
ted_{lang}_playlist-{hyphenated-playlist-name} ted_, ted_all_playlists
wikivoyage_{lang}_all wikivoyage_{lang}
kiwix.{lang}.stackoverflow.com {lang}.stackoverflow.com_{lang}
kiwix.{topic}.stackexchange.com {topic}.stackexchange.com_{lang}
anshulxyz commented 2 years ago

There were some who didn't have a relevant recipe. For example: wikipedia_nb_50000 The recipe wikipedia_nb was not in the lsit.

rgaudin commented 2 years ago

These are the patterns I have noticed

Thank you ; could you share your matching script and a CSV of Name,recipeName for all ZIMs ? Doesn't have to be clean or anything. It's just so I can run it and check them all myself.

We should probably also print the list of recipeNames with no match.

1. Wikipedia/Wikimedia sources
   1.1 First we can check for exact match `project-name_lang_classifier`
   1.2 If no match, then we can check for `project-name_lang` or `project-name_lang_all`
   1.3 If no match, then we can check for `project-name`

It seems we keep coming back to this so I'll try to be as clear as possible:

2. Khan Academy
   2.1 There was one recipe that didn't follow the hyphen case name `khanacademy-fr-kolibri`

Good to know

   2.2 Khan Academy book name follows this pattern `khan-academy-videos_{lang}_playlist-<playlist-code>`

Yes, we should request custom, intelligible Name for those.

   This book name should be matches with recipe pattern of `khan-videos-{lang}_playlists`

Same

3. Others

book name recipe name ted_{lang}playlist-{hyphenated-playlist-name} ted, ted_all_playlists

ted_all_playlists is a special recipe. We won't have identical Zimfarm recipe and Title ident in this case. We shall request better Name for the books though.

wikivoyage_{lang}all wikivoyage{lang}

We shall harmonize on Zimfarm

kiwix.{lang}.stackoverflow.com {lang}.stackoverflow.com{lang} kiwix.{topic}.stackexchange.com {topic}.stackexchange.com{lang}

Those are super old ZIMs. Newer ones have fixed Names.

Thanks again ; please share the full list as explained above.