openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Remove « .html » extension #166

Open kelson42 opened 1 year ago

kelson42 commented 1 year ago

These « .html » extensions, for example here https://library.kiwix.org/content/gutenberg_fr_all/A/Les%20Fleurs%20du%20Mal_cover.6099.html, were necessary at the time we were using zimwriterfs. Zimwriterfs neede this to identify HTML content which shoukd be indexed. This is not necessary anymore. There it should be simplified and removed for cleaner URLs and smaller ZIM size.

benoit74 commented 1 year ago

It is not that simple, or I miss something, this extension is necessary to make a distinction between the various file formats in the archive.

For instance for book ID 18812 we have these three files now:

Douze ans de séjour dans la Haute-Éthiopie.18812.epub
Douze ans de séjour dans la Haute-Éthiopie.18812.html
Douze ans de séjour dans la Haute-Éthiopie_cover.18812.html
kelson42 commented 1 year ago

@benoit74 Should not create a conflict to remove « html » for books in html. This topic will anyway disappear IMO if we implement #95.

benoit74 commented 1 year ago

Ok, I didn't got this, all files would have an extension except for the HTML version. Makes sense to me.

rgaudin commented 1 year ago

This topic will anyway disappear IMO if we implement #95.

No, we'd still need the cover page so it won't be affected.

@benoit74 beside the chrome urls (Home.html), the most important one is the cover and yes the HTML format version when it's included.

To avoid conflicts yet keep decent-looking URLs I'd propose the following:

/18812/Douze ans de séjour dans la Haute-Éthiopie  # Cover page
/18812/Douze ans de séjour dans la Haute-Éthiopie.epub
/18812/Douze ans de séjour dans la Haute-Éthiopie.pdf
/18812/Douze ans de séjour dans la Haute-Éthiopie.html

I am fine with the HTML format being named .html because it's a formatted book, is a single file that can be saved as well ; and I like consistency.

@kelson42 if you don't like it, please suggest another pattern ; keeping in mind:

kelson42 commented 1 year ago

@rgaudin Agree with your proposal.

eshellman commented 1 year ago

If it helps, I have code that will make a safe title based github-safe filename slug for any book in PG.

prathamkumarjha commented 1 year ago

hiii may i help by removing the « .html » extensions

rgaudin commented 1 year ago

@prathamkumarjha ; yes, you can submit a PR , as per my comment above.