gutenbergtools / ebookconverter

code that orchestrates ebook conversion for project gutenberg
GNU General Public License v3.0
7 stars 2 forks source link

Create -h.zip after adding headers/footers #38

Closed gbnewby closed 9 months ago

gbnewby commented 1 year ago

I'm not sure if this is for ebookconverter or elsewhere in the processing chain.

For HTML and plain text, ebookmaker adds the header+metadata and footer to posted books.

For HTML, I would like to stop pushing -h.zip and instead have that created after the header+metadata and footer is added. The -h.zip can then go in cache/epub/xxx rather than in 1/2/3/...

The *-h.zip is a very useful format since it allows download of the HTML file plus all assets. But I'd like it to have the correct up-to-date metadata from the catalog whenever the other generated formats are built.

eshellman commented 1 year ago

If I understand this correctly, you would like an additional output file which is the zipped generated html and linked files. This is not too hard.

You also want the submitted zip to be deleted after it has been pushed and unzipped.

You are correct that this belongs in ebookconverter

gbnewby commented 1 year ago

Yes, that's right: An additional output file which is the zipped generated HTML and linked files.

Thanks.

On Mon, Aug 7, 2023 at 8:00 AM Eric Hellman @.***> wrote:

If I understand this correctly, you would like an additional output file which is the zipped generated html and linked files. This is not too hard.

You also want the submitted zip to be deleted after it has been pushed and unzipped.

You are correct that this belongs in ebookconverter

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1668037410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLUJAGM5UWQARVIEFKTXUD7JBANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 1 year ago

I scoped out the work involved. There is unused (AFAICT) machinery for zip packaging in ebookmaker which could be removed - it doesn't do what we want .

gbnewby commented 1 year ago

Ok, thanks.

On Wed, Aug 23, 2023 at 2:59 PM Eric Hellman @.***> wrote:

I scoped out the work involved. There is unused (AFAICT) machinery for zip packaging in ebookmaker which could be removed - it doesn't do what we want .

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1690700990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLR27RZ7MKNPRPDOEJLXWZ4L3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 1 year ago

I've coded a Writer b5076ac4c3d31027b715737c55d27aa392aaefcc that writes the html and linked files into a zip, named pg9999.zip. SO-

  1. is pg9999.zip the right name?
  2. Should it be entered into the files database? The database expects filename.[ext].zip - do we want to invent an ext that indicates a file bundle? or create a filetype representing the bundle as zip? We should do this if we want it to appear in the web representation; or we can leave it like the log files for now, can always add later.
gbnewby commented 1 year ago

Thanks for this.

The existing convention is 9999-h.zip which unzips to 9999-h/9999-h.htm, 9999-h/images/... etc.

The folder you are zipping will be something like .../9/9/9/9999/9999-h/ (but presumably you'll drop everything but that last directory).

So, I think it would make the most sense to keep the same name: 9999-h.zip. Alternatively, pg9999-h.zip. I'd put pg9999.zip as the last choice, since there's nothing in the filename to indicate it's HTML. I am assuming this new .zip will be deposited alongside the other files in cache/epub/9999/.

On Tue, Aug 29, 2023 at 2:34 PM Eric Hellman @.***> wrote:

I've coded a Writer b5076ac https://github.com/gutenbergtools/ebookconverter/commit/b5076ac4c3d31027b715737c55d27aa392aaefcc that writes the html and linked files into a zip, named pg9999.zip. SO-

  1. is pg9999.zip the right name?
  2. Should it be entered into the files database? The database expects filename.[ext].zip - do we want to invent an ext that indicates a file bundle? or create a filetype representing the bundle as zip? We should do this if we want it to appear in the web representation; or we can leave it like the log files for now, can always add later.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1698175675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLWSZ77BXBOVDRVXEO3XXZN6ZANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 1 year ago

I'm not zipping the source directory, I'm zipping the generated html files along with any linked files.

currently, the -h.zip files are saved in the db without the filetype, only the compression type. I will change the file naming, will mimic the ../9/9/9/9999/9999-h/ zips for the ../cache/epub/9999/ zips.

eshellman commented 1 year ago

one more thing: all the generated files are named pg9999 something so for consistency I recommend pg9999-h.zip as the zip file name - this will help distinguish the generated zips from the 'source' zips.

gbnewby commented 1 year ago
  1. For the filename: pg9999-h.zip is fine.

  2. Zipping generated files instead of as-submitted files is fine.

  3. I question the lack of a directory structure. It looks like you are flattening the images and other assets and copying them to cache/epub

That seems like a mistake for anyone getting the *-h.zip since it unzips flat and there can be hundreds of files. This also clutters the cache/epub needlessly.

Instead, I think you should replicate the directory structure of as-submitted HTML.

E.g., when moving 1/2/1/2/12122/12122-h/122122-h.htm to cache/epub/12122-h/pg12122, also move images/ and other directories and their content.

This will also eliminate the need to rewrite the URLs in the generated HTML.

I realize this is a bigger change than just making new *-h.zip files, but when people unzip those files it's much friendlier to unzip assets into their own directories.

On Thu, Aug 31, 2023 at 11:27 AM Eric Hellman @.***> wrote:

one more thing: all the generated files are named pg9999 something so for consistency I recommend pg9999-h.zip as the zip file name - this will help distinguish the generated zips from the 'souce' zips.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1701556932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLVHBJEWPKG3HOUL2GLXYDJS3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 1 year ago

directory structure is preserved inside the zip so that links still work.

gbnewby commented 1 year ago

directory structure is preserved inside the zip so that links still works

That's good, but why flatten everything in cache/epub?

For people who don't download the .zip but instead "Save as / HTML complete" (or similar) from their browser, everything is flattened needlessly. I don't understand why the directory structure is flattened.

On Thu, Aug 31, 2023 at 11:52 AM Eric Hellman @.***> wrote:

directory structure is preserved inside the zip so that links still work.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1701587937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX55B6NC437X65PHZLXYDMN5ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 1 year ago

Not sure what you mean by "everything is flattened". what is flattened?

gbnewby commented 1 year ago

Never mind - I chose some bad books to look where there is no images/ subdirectory.

I looked further, and see that in fact the relative folder structure is maintained. The folder structure is not flattened.

Sorry for my false alarm. I don't see any other desired changes to make the *-h.zip.

Thanks.

On Thu, Aug 31, 2023 at 11:56 AM Eric Hellman @.***> wrote:

Not sure what you mean by "everything is flattened". what is flattened?

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1701593857, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLTJDXFEG524RLQWGNLXYDM7JANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>

eshellman commented 1 year ago

super

gbnewby commented 1 year ago

I'm checking back on this. Is it coded and queued up for the next release of ebookconverter?

eshellman commented 1 year ago

yes b5076ac4c3d31027b715737c55d27aa392aaefcc and e1b47ca91920489e1aba4ea6e9e96a965c69f584

gbnewby commented 11 months ago

Reopening since this isn't quite right yet.

The HTML file in the .zip retains the .utf8 extension, which will prevent it opening correctly in web browsers until the file is renamed.

We do some .htaccess magic in Apache so that the .utf8 extension is handled correctly, but I never thought this was necessary and still don't.

Minimally, the HTML in the .zip needs to end in .html or .htm. But I think it would be perfectly safe for all generated HTML files to end in .htm like in the 1/2/3 filesystem (without .utf8). I realize that .htm versus .html is a preference, and since I presume it's easy to have .htm emitted from the generated code but hard to change all the .htm in 1/2/3 to .html, I prefer changing the generated code.

Minimally, though, the .zip needs to have a file that browsers will recognize as HTML. Thanks for this.

On Fri, Sep 22, 2023 at 10:09 AM Eric Hellman @.***> wrote:

Closed #38 https://github.com/gutenbergtools/ebookconverter/issues/38 as completed.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#event-10449588571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLSHIJBDHG2NF7FBNNLX3XA5NANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.*** com>

gbnewby commented 11 months ago

Filename extension in .zip needs to be fixed. Generally, I am interested in (1) ditching .utf8 and (2) having generated HTML end in .htm.

eshellman commented 11 months ago

I would like to just remove the .utf8 from the name of the generated file. But... I prefer .html because it helps distinguish it from the submitted .htm files, also, changing to .htm will require a number of code changes to avoid breaking links. unfortunately, changing the generated file name might also not be simple; just changing the file name to .html inside the zip will be simpler. Although, I'll look to see if it's trivial.

gbnewby commented 11 months ago

Ok, that's fine. I'm not that fussy about .html versus .htm, but it's very user-unfriendly to have .utf8 in the packaged .zip.

On Thu, Sep 28, 2023 at 11:40 AM Eric Hellman @.***> wrote:

I would like to just remove the .utf8 from the name of the generated file. But... I prefer .html because it helps distinguish it from the submitted .htm files, also, changing to .htm will require a number of code changes to avoid breaking links. unfortunately, changing the generated file name might also not be simple; just changing the file name to .html inside the zip will be simpler. Although, I'll look to see if it's trivial.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1739830578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX46ECRRZJYK6PONKDX4XAC3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>

eshellman commented 11 months ago

I've verified that the magic you have to remove the '.utf8' will serve the '.html' file if it's there, so changing the name of the generated file will be a one-liner.

after a month, we can remove the magic and save on one round trip of redirect and perhaps a disk seek.

I'd appreciate if you could look at the config magic to double check this.

On Sep 28, 2023, at 3:01 PM, Greg Newby @.***> wrote:

Ok, that's fine. I'm not that fussy about .html versus .htm, but it's very user-unfriendly to have .utf8 in the packaged .zip.

On Thu, Sep 28, 2023 at 11:40 AM Eric Hellman @.***> wrote:

I would like to just remove the .utf8 from the name of the generated file. But... I prefer .html because it helps distinguish it from the submitted .htm files, also, changing to .htm will require a number of code changes to avoid breaking links. unfortunately, changing the generated file name might also not be simple; just changing the file name to .html inside the zip will be simpler. Although, I'll look to see if it's trivial.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1739830578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX46ECRRZJYK6PONKDX4XAC3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1739856454, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMNOIPNTALUOTTA7VLLX4XCQ7ANCNFSM6AAAAAA3FPPP2A. You are receiving this because you modified the open/close state.

gbnewby commented 11 months ago

Sorry I'm missing something: where should I look?

I thought maybe this was in the .htaccess equivalent, but didn't see it there. If you could give me a little guidance, I will dig into it. Thanks.

eshellman commented 11 months ago

I thought it was in .htaccess

On Sep 29, 2023, at 1:31 PM, Greg Newby @.***> wrote:

Sorry I'm missing something: where should I look?

I thought maybe this was in the .htaccess equivalent, but didn't see it there. If you could give me a little guidance, I will dig into it. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741258974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMMWWG4SCUAUQS6YSNTX44AXLANCNFSM6AAAAAA3FPPP2A. You are receiving this because you modified the open/close state.

gbnewby commented 11 months ago

If that is what you were thinking, my guess is that a file named .html rather than .html.utf8 will just work. Give it a try. For example, by renaming a file in cache/epub.

There is a rule in .htaccess to redirect .utf8 text files, but nothing for .htm or .html: RewriteRule ^ebooks/([0-9]+).txt.utf-8$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1.txt [L,R]

However, if someone followed a link to .txt, like to .htm or .html, it should "just work." Apache certainly knows how to serve up text and HTML files.

I seem to think there used to be rules for .html.utf8. It's possible there is something in the Apache configuration about this (not .htaccess, but the system configuration). Easy enough to just test and confirm.

Since .htm files in 1/2/3/ are served correctly, I am betting that .html files in cache/epub will also be served correctly. Without needing to add .utf8 to the end of the filename.

Let me know if I might be missing something about this. Message ID: @.***>

eshellman commented 11 months ago

yes, that's what I did, it works.

I just want to confirm that the rule does what we think it does.

I'd like to know why it redirects twice

On Sep 29, 2023, at 5:15 PM, Greg Newby @.***> wrote:

If that is what you were thinking, my guess is that a file named .html rather than .html.utf8 will just work. Give it a try. For example, by renaming a file in cache/epub.

There is a rule in .htaccess to redirect .utf8 text files, but nothing for .htm or .html: RewriteRule ^ebooks/([0-9]+).txt.utf-8$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1.txt [L,R]

However, if someone followed a link to .txt, like to .htm or .html, it should "just work." Apache certainly knows how to serve up text and HTML files.

I seem to think there used to be rules for .html.utf8. It's possible there is something in the Apache configuration about this (not .htaccess, but the system configuration). Easy enough to just test and confirm.

Since .htm files in 1/2/3/ are served correctly, I am betting that .html files in cache/epub will also be served correctly. Without needing to add .utf8 to the end of the filename.

Let me know if I might be missing something about this. Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741497504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMNBEAKZQ6JC7MUMMV3X443AHANCNFSM6AAAAAA3FPPP2A. You are receiving this because you modified the open/close state.

gbnewby commented 11 months ago

Great!

Could you please quote the rule you are referring to?

On Fri., Sep. 29, 2023, 2:45 p.m. Eric Hellman, @.***> wrote:

yes, that's what I did, it works.

I just want to confirm that the rule does what we think it does.

I'd like to know why it redirects twice

On Sep 29, 2023, at 5:15 PM, Greg Newby @.***> wrote:

If that is what you were thinking, my guess is that a file named .html rather than .html.utf8 will just work. Give it a try. For example, by renaming a file in cache/epub.

There is a rule in .htaccess to redirect .utf8 text files, but nothing for .htm or .html: RewriteRule ^ebooks/([0-9]+).txt.utf-8$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1.txt [L,R]

However, if someone followed a link to .txt, like to .htm or .html, it should "just work." Apache certainly knows how to serve up text and HTML files.

I seem to think there used to be rules for .html.utf8. It's possible there is something in the Apache configuration about this (not .htaccess, but the system configuration). Easy enough to just test and confirm.

Since .htm files in 1/2/3/ are served correctly, I am betting that .html files in cache/epub will also be served correctly. Without needing to add .utf8 to the end of the filename.

Let me know if I might be missing something about this. Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741497504>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAHCGMNBEAKZQ6JC7MUMMV3X443AHANCNFSM6AAAAAA3FPPP2A>.

You are receiving this because you modified the open/close state.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741521148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX26STZQINII4MPRLDX446ONANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>

eshellman commented 11 months ago

for example, a request to /cache/epub/111/pg111-images.html gets redirected to /cache/epub/111/pg111-images.html (no change) and I'm assuming this is some interaction between apache and a load balancer

eshellman commented 11 months ago

I haven't found the rule that does this

eshellman commented 11 months ago

it seems that AddCharset UTF-8 .utf8 captures the extension and sets the the encoding in the content-type header But this is the first rule triggered: RewriteRule ^ebooks/([0-9]+)\.html\.images$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1-images.html [L,R] so the question is whether we need to add a config to set utf-8and reproduce the existing behavior.

gbnewby commented 11 months ago

Just to confirm: What we're talking about is having generated HTML files end in .html rather than the current ending of .html.utf8. Right?

I am confident we don't need any new Apache rule for this. It is bread-and-butter behavior for the web server, out of the box.

The rule you cite does, indeed, rewrite the URL for .html.images to .html. I don't think that's related to files ending in .utf8.

From your example, I did not see a redirect just now:

$ wget https://www.gutenberg.org/cache/epub/111/pg111-images.html --2023-09-29 16:57:50-- https://www.gutenberg.org/cache/epub/111/pg111-images.html Resolving www.gutenberg.org (www.gutenberg.org)... 2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47 Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 551065 (538K) [text/html] Saving to: ‘pg111-images.html’

pg111-images.html 100%[===================>] 538.15K 768KB/s in 0.7s

2023-09-29 16:57:51 (768 KB/s) - ‘pg111-images.html’ saved [551065/551065]

On Fri, Sep 29, 2023 at 3:17 PM Eric Hellman @.***> wrote:

it seems that AddCharset UTF-8 .utf8 captures the extension and sets the the encoding in the content-type header But this is the first rule triggered: RewriteRule ^ebooks/([0-9]+).html.images$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1-images.html [L,R] so the question is whether we need to add a config to set utf-8and reproduce the existing behavior.

— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741542603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLVIHOT5SZQ2XHA7RWLX45CIDANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>

eshellman commented 9 months ago

everything looks correct