openzim / wp1_selection_tools

Create selections with the best articles of a WM project
https://download.kiwix.org/wp1/
GNU General Public License v3.0
6 stars 3 forks source link

Handling redirects with local anchor - Duplicated content in wikipedia_en_medecine_nodet_2019-08. #27

Closed ghost closed 4 years ago

ghost commented 4 years ago

@mgautierfr commented on Aug 11, 2019, 5:24 PM UTC:

Article http://library.kiwix.org/wikipedia_en_medicine_nodet_2019-08/A/Retina and http://library.kiwix.org/wikipedia_en_medicine_nodet_2019-08/A/Lipemia_retinalis are very close.

They have the same title and the content is the same. There are some difference in the html. (Especially a <link href="../-/s/css_modules/mediawiki.action.view.redirectPage.css" rel="stylesheet" type="text/css" class=""> in A/Lipemia_retinalis.

This issue was moved by kelson42 from openzim/mwoffliner#938.

ghost commented 4 years ago

@kelson42 commented on Aug 11, 2019, 5:39 PM UTC:

@ISNIT0 "Lipemia retinalis" should be a redirect to "Retina"

ghost commented 4 years ago

@kelson42 commented on Aug 24, 2019, 5:31 PM UTC:

@ISNIT0 The problem here is that we don't have a simple redirect from "Lipemia retinalis" should be a redirect to "Retina", we have a redirect to a specific paragraph"Retina#Diseases and disorders". This is not possible to do with the built-in ZIM redirect system. This should be done with a normal HTML page redirect.

ghost commented 4 years ago

@ISNIT0 commented on Aug 27, 2019, 10:22 AM UTC:

@kelson42 Do you mean the problem is that the articleList contains redirects, and MWO is not resolving them?

ghost commented 4 years ago

@kelson42 commented on Aug 27, 2019, 10:38 AM UTC:

@ISNIT0 Yes... I confirm it is in https://ftp.nluug.nl/pub/kiwix/wp1/enwiki_2019-08/customs/medicine. You mean it might be a duplicate of #889?

ghost commented 4 years ago

@ISNIT0 commented on Aug 27, 2019, 10:38 AM UTC:

I think it is, yes

ghost commented 4 years ago

@kelson42 commented on Aug 27, 2019, 10:39 AM UTC:

@ISNIT0 It is, but the problem is a bit more complex here because of the hash at the end of the URL... But I would be agree to handle it together in 2.0 if this is what you prefer. This is indeed really similar.

ghost commented 4 years ago

@ISNIT0 commented on Aug 27, 2019, 10:40 AM UTC:

Surely that's fine? The reader serves the content from the redirect, and the hash is preserved? I have a fix nearly ready

ghost commented 4 years ago

@mgautierfr commented on Sep 2, 2019, 1:53 PM UTC:

For information, zimwriterfs parse the content of the html to detect redirect and create a redirect article. (See https://github.com/openzim/zimwriterfs/blob/master/src/article.cpp#L143-L159)

ghost commented 4 years ago

@ISNIT0 commented on Sep 2, 2019, 3:30 PM UTC:

This issue should have been closed by the PR above. I will close for now, we can re-open if it's not fixed :)

ghost commented 4 years ago

@kelson42 commented on Sep 2, 2019, 3:48 PM UTC:

@ISNIT0 Thx for the fix, I just reopen it, because I want to verify this by myself.

ghost commented 4 years ago

stale[bot] commented on Nov 1, 2019, 3:54 PM UTC:

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

ghost commented 4 years ago

@kelson42 commented on Mar 17, 2020, 6:13 PM UTC:

The problem here is how we create the list of articles for the medicine selection, the list is full of redirects.

kelson42 commented 4 years ago

/move openzim/mwoffliner

kelson42 commented 4 years ago

openzim/mwoffliner was the right place for that ticket I believe