openzim / wp1

Wikipedia 1.0 engine & selection tools
https://wp1.openzim.org
GNU General Public License v2.0
24 stars 17 forks source link

Allow exclusion of specific articles in WikiProject Selection to avoid projects failing due to a few missing articles #737

Open Jaifroid opened 4 months ago

Jaifroid commented 4 months ago

Summary: WikiProjects are often not up-to-date and may have several articles that are no longer in Wikipedia (or whose title has changed). This appears to cause mwOflliner to abort (I know that's an old chestnut, but it is what it is). So, we need a way to provide a list of articles to exclude when editing the WikiProject Selection. Currently there is only a way to exclude whole (sub) Projects (unless exclusion list is more flexible than it appears to be).

Detail: So, I tested this great tool by trying to build a Wikipedia ZIM of literary topics, with a few projects like Literature, Poetry, etc. This all went smoothly, except that the ZIM failed to build due to a bunch of missing articles that are no-doubt referenced in out-of-date projects. You can see the rather large list of failures I got here.

I was able to use that json output to make a simple list of the articles I need to exclude, only to find that there is no way to exclude individual articles (as opposed to individual projects) in the WikiProject Selection / editing tool. While it can probably be done in SPARQL, realistically users are not going to spend the time learning that syntax, which looks positively evil 😈...

The more this process could be automated, the better (I know, easier said than done).

audiodude commented 4 months ago

Thanks for the report/write up!

A repro case would be great for this, if you have the list of WikiProjects and a few of the articles that were missing. That way I can check if we have "deleted/missing" information in our own database for them and can filter them out at that step.

I honestly would rather implement this as #728, where you specify a Simple Selection to "set difference" (subtract), rather than a bespoke implementation just for WikiProject Selection.

Jaifroid commented 4 months ago

@audiodude Info below for easy repro (using WikiProject tool). I also attach the .tsv produced by this list of projects (but renamed with a .txt, because GitHub doesn't support .tsv attachments - very bottom).

I suppose I could manually subtract the failing articles from the .tsv and then use Simple Selection to upload a (very long) list of articles, but that would be pretty tedious to do manually... Hence it would be ideal to be able to specify "scrape these projects, but not these articles in the projects". Quite similar to the idea in #728, but it would be great if that could work on Project level too.

In sum, I think there's a lot of value in being able to select a set of related WikiProjects rather than forcing the user to list all the articles in the projects. The underlying problem seems to be that a lot of projects are not well maintained, so it's quite likely other users will run into this issue unless their needs are pretty basic (like a handful of articles they want to archive).

List of WikiProjects selected:

Literature
Theatre
Australian literature
Children and young adult literature
Electronic literature
Indian literature
Philosophical literature
Poetry
Women writers
Mythology

List of not-found articles (or assets) I extracted from the task failure's JSON with a simple search-replace regex on the contents of the error key:

Performing_arts_venues_in_Australia
Theatre_in_Norway
Works_set_in_South_Carolina
Fiction_set_in_1761
Bertsolaris
Fictional_trains
1900_poetry_books
Moroccan_women_writers_by_century
15th-century_Portuguese_poets
Greek_dragons
Theat-stub
Lizzie_Zipmouth_cover.jpg
Writers_from_Paramaribo
Cultural_depictions_of_Edgar_Allan_Poe
Start-Class_Poetry_articles
Plays_by_Barbara_Garson
Hansel-and-gretel-rackham.jpg
Prudent-Louis_Leray_-_Poster_for_the_première_of_Georges_Bizet's_Carmen.jpg
Heroes_in_Norse_myths_and_legends
Writers_from_Missouri
Male_actors_from_Queens,_New_York
American_LGBT_dramatists_and_playwrights
Children_of_Eos
Women_business_writers
Novels_by_Tamora_Pierce
15th-century_Italian_poets
Sargent_-_Robert_Louis_Stevenson_and_His_Wife.jpg
Literary_festivals_in_Ireland
Pre-Islamic_Arabian_poets
Fiction_set_in_the_1960s
1771_plays
Swedish_poetry_collections
Poetry_festivals_in_South_America
Fiction_set_in_the_2110s
Theatre_companies_in_the_Netherlands
Poet-stub
Philosophy_radio_programs
1969_radio_dramas
1778_plays
Malaysian_expatriate_actors_in_India
Steer_-_Pirateology_-_A_Pirate_Hunter's_Companion_Coverart.png
Theatres_completed_in_2016
Africa-myth-book-stub
Theat-director-stub
1920s_poems
Literature/Adil_archive/November/15
Theatres_completed_in_1880
Fiction_about_filicide
1970s-child-hist-novel-stub
1810s_poetry_books
16th-century_Italian_novelists
DramaDesk_PlayDirection_1975–2000
Tyrfing_cycle
Nigerian_expatriate_actors_in_the_United_States
Soundofcolors.jpg
Serbian_speculative_fiction
Riitta_Nelimarkka
The_Bread_Winner.jpg
16th-century_poets_by_nationality
16th-century_writers
Nigerian_expatriate_male_actors_in_the_United_States
Mexican_poets
Dominican_Republic_women_poets
The_Age_of_Innocence_book.jpg
1725_plays
VerdigrisDeep.jpg
Literature/Selected_work/23
Fictional_characters_introduced_in_the_13th_century
1730_in_theatre
11th-century_Icelandic_poets
Fiction_set_in_11th-century_Abbasid_Caliphate
Relatively-speaking.jpg
Theatres_in_the_Netherlands
Indonesian_women_dramatists_and_playwrights
The_Book_of_Mormon_(musical)_character_redirects_to_lists
Fictional_centaurs
Christian_writers_about_eschatology
Love_stargirl_book.jpg
1642_in_theatre
Tajikistani_poets

LiteraryTopics.tsv (rename txt file back to tsv)

audiodude commented 4 months ago

Thanks for the detailed repro information!

I think before we build new features, we should focus on the first part of your bug report, where "the more this process could be automated, the better". The theory that the WikiProjects are somehow out of date makes sense, but it's not true. WP1 builds its knowledge of what articles are in what project based on the categories the article is placed in on its talk page. So it's WP1 itself that's out of date.

This is a actually a legitimate bug, #738. We don't actually query the wiki replica database for if an article is deleted, we just check if it's been moved and mark it as quality=NotAClass/importance=NotAClass, then clean it up later. The bug describes the rest.

What's happening is that many of those articles which are showing up in the Selection are Category-Class, so they never got deleted.

Jaifroid commented 4 months ago

I think before we build new features, we should focus on the first part of your bug report, where "the more this process could be automated, the better".

I agree absolutely. The only reason I could have for excluding specific articles or assets from a Project compilation would be because they are causing mwOffliner to fail. I don't know why they cause it to fail, as mwOflliner should ignore 404s if I've understood correctly: maybe there's a limit to the number of 404s it will ignore before bailing. In any case, if the logic for detecting entries for exclusion can be made more robust, it would be a good solution.