openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
35 stars 2 forks source link

Some gutenberg books have not been updated for a while and have bad names #841

Open B-root74 opened 4 months ago

B-root74 commented 4 months ago

Some gutenberg books have not been updated for a while (somewhere in 2022) and have bad names (with a period at the end).

gutenberg_ale_all_2022-08
gutenberg_ang_all_2022-08
gutenberg_bgs_all_2022-08
gutenberg_brx_all_2022-07
gutenberg_csb_all_2022-07
gutenberg_grc_all_2022-07
gutenberg_kha_all_2022-05
gutenberg_kld_all_2022-08
gutenberg_ko_all_2022-08
gutenberg_nai_all_2022-08
gutenberg_nav_all_2022-05

It looks like these languages are not available anymore at Gutenberg project, so I think these files might securely be deleted.

eshellman commented 4 months ago

These are the single book languages in PG. They're still there. Just checking two of these: https://www.gutenberg.org/browse/languages/kha https://www.gutenberg.org/browse/languages/ko

B-root74 commented 4 months ago

Oh great, thank you!

Then it is just a matter of changing the ZIM metadata? Why are these ZIMs not updated by openZIM while all other are?

eshellman commented 4 months ago

maybe a zero-index bug?

benoit74 commented 4 months ago

@Popolechien why do you want to remove these files? did we made a decision in the past to not publish them anymore?

If @eshellman is right (and I know he probably is right), I see no reason to not publish these ZIMs, we "just" have to fix the scraper

Is it correct to say that all these books are referenced as having multiple languages? I'm quite inclined to believe this might be the issue, I don't think Gutenberg is capable to support multiple languages per book nicely.

Popolechien commented 4 months ago

@benoit74 I just saw the word "delete" and mindlessly hit the 'assign' button 😁

More seriously, the question stands as to why these are not being updated: the last run occured 11 hours ago and the latest ko / kha files still are from May 2022.

benoit74 commented 4 months ago

If everytime a random contributor (no offense @B-root74) states that something has to be deleted we delete it, we might have run into troubles 🤪

Anyway, we are all aligned, there is probably an issue in the scraper. https://github.com/openzim/gutenberg/issues/218

benoit74 commented 4 months ago

Oh, but the Remove label only means that someone is requesting to delete a file, not that we have accepted to delete it? I find it a bit misleading (at least I was mislead), I would rename it "Removal request".

Popolechien commented 4 months ago

Nevermind, it's on me. Should teach me not to check new issues on a Sunday ^^