Closed rgaudin closed 2 years ago
First of all, book-title reconciliation should happen during book creation. If it fails (not likely), then we'll end up with dangling books that will need to be assigned Titles to manually.
Titles are identified by an ident
property which we want to mimic Zimfarm's recipes.
Should we receive a valid (see format bellow) recipe
property in the endpoint, we should use it. I will check with @kelson42 whether we want Zimfarm to submit its recipe name to book-add endpoint.
The title/Zimfarm-recipe format looks like: {project}_{lang}[_{classifier}]
project
is either an identified project name like wikipedia
, wikibooks
, or gutenberg
or a domain-name (stackoverflow.com
), depending on the source.lang
is the ISO 639-1 (2-letters long) language code of the content. ⚠️ this is different to the 3-letters ISO 639-3 used in the Language
metadata. zimscraperlib can convert it though. Multi-language ZIMs may either use the most important language or mul
(which is not ISO 639-1 valid).classifier
is optional and can represent different things:
selection
. For instance, on Wikipedia ZIMs, we have the all
selection meaning all articles, but also top
or football
or movies
, etc.flavour
. This is a tailored version of a ZIM for a specific use or audience. We often use maxi
or nopic
or mini
but this is flexible. Flavour may be found in Flavour
metadata.selection
and flavour
can be combined.Sample existing, valid recipe names:
wiktionary_af
stackoverflow.com_bard
wikihow_ar_maxi
les-fondamentaux_fr_all
wikipedia_en_all_maxi
french.stackexchange.com_mul
This format is close to the Name
metadata (which is mandatory) found in Books. Reconciliation will thus be from Name
metadata to Tilte.ident
.
I don't think we want to fallback to using filenames but will check. Those usually look like {Name}_{period-year}-{period-month}.zim
.
When creating Titles (when we're confident there is none for this book), we'll probably reuse that name but we need to ensure the format is valid with what's above.
I think a good exercice for this tool would be to try and reconcile all the existing Zimfarm recipes' names (JSON here) with the ZIM catalog's books names (XML here)
curl https://api.farm.openzim.org/v1/schedules/backup/ | jq '.[].name'
curl -L http://download.kiwix.org/library/library_zim.xml | python3 -c 'import sys; import json; from xml.dom.minidom import parseString; dom = parseString(sys.stdin.read()); [print(book.getAttribute("name")) for book in dom.getElementsByTagName("book")]'
@rgaudin I have some confusion.
1.
find the appropriate Title for a book's metadata
I assume this is different from the "Title" key in the Metadata. Because that looks like "English Wikipedia". While what I see in your comments look like "Name" key in the Metadata. Because that looks like wikipedia_fr_football
. That's what the Zimfarm-recipe looks like.
Titles are identified by an
ident
property which we want to mimic Zimfarm's recipes.
Again when I look at the Zimfarm recipes I see that it has "Name"
Can you tell me what if I have understood correctly, that when you mean Title here, you actually mean Name in Zimfarm's recipes.
Should we receive a valid
I know I can use the find_language_names
to see if a valid language code was passed, except for mul
.
Not sure how I'll test for valid project
name and classifier
, unless I have a list of allowed keywords/names.
find the appropriate Title for a book's metadata
I assume this is different from the "Title" key in the Metadata. Because that looks like "English Wikipedia". While what I see in your comments look like "Name" key in the Metadata. Because that looks like
wikipedia_fr_football
. That's what the Zimfarm-recipe looks like.
Yes, the reconciliation is between a ZIM's Name
metadata and a Book ident
.
Titles are identified by an
ident
property which we want to mimic Zimfarm's recipes.Again when I look at the Zimfarm recipes I see that it has "Name"
Can you tell me what if I have understood correctly, that when you mean Title here, you actually mean Name in Zimfarm's recipes.
It's important to understand that Title has two meaning. As noted in the Wiki, the Title
model reference an Entry in the CMS/Catalog/Library. Following the Library analogy, Barack Obama's A promised land is a Title for which different books exists: Hard cover, paperback, audiobook, e-book. Each of these books have an Author, an Editor and a title.
In the CMS, we'll have Title
instances represented by an ident like wikipedia_fr_all
which would expose different Book
instances (ZIM files) identified by a UUID.
The Title
ident that we need to build is should resemble Zimfarm recipes (internally called schedules
) names. Those are visible on the recipes page
Should we receive a valid
I know I can use the
find_language_names
to see if a valid language code was passed, except formul
.
Correct
Not sure how I'll test for valid
project
name andclassifier
, unless I have a list of allowed keywords/names.
Those are flexible so there is no allowed list. We'll just use a slug-like regex like [a-z0-9_\-\.]+
.
@rgaudin
These are the patterns I have noticed
Wikipedia/Wikimedia sources
1.1 First we can check for exact match project-name_lang_classifier
1.2 If no match, then we can check for project-name_lang
or project-name_lang_all
1.3 If no match, then we can check for project-name
Khan Academy
2.1 There was one recipe that didn't follow the hyphen case name khanacademy-fr-kolibri
2.2 Khan Academy book name follows this pattern khan-academy-videos_{lang}_playlist-<playlist-code>
This book name should be matches with recipe pattern of khan-videos-{lang}_playlists
Others
book name | recipe name |
---|---|
ted_{lang}_playlist-{hyphenated-playlist-name} | ted_ |
wikivoyage_{lang}_all | wikivoyage_{lang} |
kiwix.{lang}.stackoverflow.com | {lang}.stackoverflow.com_{lang} |
kiwix.{topic}.stackexchange.com | {topic}.stackexchange.com_{lang} |
There were some who didn't have a relevant recipe. For example: wikipedia_nb_50000
The recipe wikipedia_nb
was not in the lsit.
These are the patterns I have noticed
Thank you ; could you share your matching script and a CSV of Name,recipeName
for all ZIMs ? Doesn't have to be clean or anything. It's just so I can run it and check them all myself.
We should probably also print the list of recipeNames with no match.
1. Wikipedia/Wikimedia sources 1.1 First we can check for exact match `project-name_lang_classifier` 1.2 If no match, then we can check for `project-name_lang` or `project-name_lang_all` 1.3 If no match, then we can check for `project-name`
It seems we keep coming back to this so I'll try to be as clear as possible:
Name
are mostly OK and not subject to much change. We can't change existing ones but if there is an issue with the Name
, we can fix that in the creator and re-produce the ZIM file. This is our base: that's what we get from the API payload.Title.ident
property. Knowing which ones doesn't comply with our expected ident
rule is important for me (to harmonize tools) but not a blocker for you.ident
from the Name
metadata it received.{project}_{lang}_{classifier}
Name
variations so we can clearly state what the classifier
part will be.Name
metadata (in the sense of non-respecting this convention format)Name
from an ident
.2. Khan Academy 2.1 There was one recipe that didn't follow the hyphen case name `khanacademy-fr-kolibri`
Good to know
2.2 Khan Academy book name follows this pattern `khan-academy-videos_{lang}_playlist-<playlist-code>`
Yes, we should request custom, intelligible Name for those.
This book name should be matches with recipe pattern of `khan-videos-{lang}_playlists`
Same
3. Others
book name recipe name ted_{lang}playlist-{hyphenated-playlist-name} ted, ted_all_playlists
ted_all_playlists
is a special recipe. We won't have identical Zimfarm recipe and Title ident in this case. We shall request better Name
for the books though.
wikivoyage_{lang}all wikivoyage{lang}
We shall harmonize on Zimfarm
kiwix.{lang}.stackoverflow.com {lang}.stackoverflow.com{lang} kiwix.{topic}.stackexchange.com {topic}.stackexchange.com{lang}
Those are super old ZIMs. Newer ones have fixed Names.
Thanks again ; please share the full list as explained above.
Required for #24, a mechanism to find the appropriate Title for a book's metadata, if there is such a title in the database.