Closed gbnewby closed 9 months ago
If I understand this correctly, you would like an additional output file which is the zipped generated html and linked files. This is not too hard.
You also want the submitted zip to be deleted after it has been pushed and unzipped.
You are correct that this belongs in ebookconverter
Yes, that's right: An additional output file which is the zipped generated HTML and linked files.
Thanks.
On Mon, Aug 7, 2023 at 8:00 AM Eric Hellman @.***> wrote:
If I understand this correctly, you would like an additional output file which is the zipped generated html and linked files. This is not too hard.
You also want the submitted zip to be deleted after it has been pushed and unzipped.
You are correct that this belongs in ebookconverter
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1668037410, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLUJAGM5UWQARVIEFKTXUD7JBANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>
I scoped out the work involved. There is unused (AFAICT) machinery for zip packaging in ebookmaker which could be removed - it doesn't do what we want .
Ok, thanks.
On Wed, Aug 23, 2023 at 2:59 PM Eric Hellman @.***> wrote:
I scoped out the work involved. There is unused (AFAICT) machinery for zip packaging in ebookmaker which could be removed - it doesn't do what we want .
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1690700990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLR27RZ7MKNPRPDOEJLXWZ4L3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>
I've coded a Writer b5076ac4c3d31027b715737c55d27aa392aaefcc that writes the html and linked files into a zip, named pg9999.zip. SO-
Thanks for this.
The existing convention is 9999-h.zip which unzips to 9999-h/9999-h.htm, 9999-h/images/... etc.
The folder you are zipping will be something like .../9/9/9/9999/9999-h/ (but presumably you'll drop everything but that last directory).
So, I think it would make the most sense to keep the same name: 9999-h.zip. Alternatively, pg9999-h.zip. I'd put pg9999.zip as the last choice, since there's nothing in the filename to indicate it's HTML. I am assuming this new .zip will be deposited alongside the other files in cache/epub/9999/.
On Tue, Aug 29, 2023 at 2:34 PM Eric Hellman @.***> wrote:
I've coded a Writer b5076ac https://github.com/gutenbergtools/ebookconverter/commit/b5076ac4c3d31027b715737c55d27aa392aaefcc that writes the html and linked files into a zip, named pg9999.zip. SO-
- is pg9999.zip the right name?
- Should it be entered into the files database? The database expects filename.[ext].zip - do we want to invent an ext that indicates a file bundle? or create a filetype representing the bundle as zip? We should do this if we want it to appear in the web representation; or we can leave it like the log files for now, can always add later.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1698175675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLWSZ77BXBOVDRVXEO3XXZN6ZANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>
I'm not zipping the source directory, I'm zipping the generated html files along with any linked files.
currently, the -h.zip files are saved in the db without the filetype, only the compression type. I will change the file naming, will mimic the ../9/9/9/9999/9999-h/ zips for the ../cache/epub/9999/ zips.
one more thing: all the generated files are named pg9999 something so for consistency I recommend pg9999-h.zip as the zip file name - this will help distinguish the generated zips from the 'source' zips.
For the filename: pg9999-h.zip is fine.
Zipping generated files instead of as-submitted files is fine.
I question the lack of a directory structure. It looks like you are flattening the images and other assets and copying them to cache/epub
That seems like a mistake for anyone getting the *-h.zip since it unzips flat and there can be hundreds of files. This also clutters the cache/epub needlessly.
Instead, I think you should replicate the directory structure of as-submitted HTML.
E.g., when moving 1/2/1/2/12122/12122-h/122122-h.htm to cache/epub/12122-h/pg12122, also move images/ and other directories and their content.
This will also eliminate the need to rewrite the URLs in the generated HTML.
I realize this is a bigger change than just making new *-h.zip files, but when people unzip those files it's much friendlier to unzip assets into their own directories.
On Thu, Aug 31, 2023 at 11:27 AM Eric Hellman @.***> wrote:
one more thing: all the generated files are named pg9999 something so for consistency I recommend pg9999-h.zip as the zip file name - this will help distinguish the generated zips from the 'souce' zips.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1701556932, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLVHBJEWPKG3HOUL2GLXYDJS3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>
directory structure is preserved inside the zip so that links still work.
directory structure is preserved inside the zip so that links still works
That's good, but why flatten everything in cache/epub?
For people who don't download the .zip but instead "Save as / HTML complete" (or similar) from their browser, everything is flattened needlessly. I don't understand why the directory structure is flattened.
On Thu, Aug 31, 2023 at 11:52 AM Eric Hellman @.***> wrote:
directory structure is preserved inside the zip so that links still work.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1701587937, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX55B6NC437X65PHZLXYDMN5ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>
Not sure what you mean by "everything is flattened". what is flattened?
Never mind - I chose some bad books to look where there is no images/ subdirectory.
I looked further, and see that in fact the relative folder structure is maintained. The folder structure is not flattened.
Sorry for my false alarm. I don't see any other desired changes to make the *-h.zip.
Thanks.
On Thu, Aug 31, 2023 at 11:56 AM Eric Hellman @.***> wrote:
Not sure what you mean by "everything is flattened". what is flattened?
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1701593857, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLTJDXFEG524RLQWGNLXYDM7JANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.***>
super
I'm checking back on this. Is it coded and queued up for the next release of ebookconverter?
yes b5076ac4c3d31027b715737c55d27aa392aaefcc and e1b47ca91920489e1aba4ea6e9e96a965c69f584
Reopening since this isn't quite right yet.
The HTML file in the .zip retains the .utf8 extension, which will prevent it opening correctly in web browsers until the file is renamed.
We do some .htaccess magic in Apache so that the .utf8 extension is handled correctly, but I never thought this was necessary and still don't.
Minimally, the HTML in the .zip needs to end in .html or .htm. But I think it would be perfectly safe for all generated HTML files to end in .htm like in the 1/2/3 filesystem (without .utf8). I realize that .htm versus .html is a preference, and since I presume it's easy to have .htm emitted from the generated code but hard to change all the .htm in 1/2/3 to .html, I prefer changing the generated code.
Minimally, though, the .zip needs to have a file that browsers will recognize as HTML. Thanks for this.
On Fri, Sep 22, 2023 at 10:09 AM Eric Hellman @.***> wrote:
Closed #38 https://github.com/gutenbergtools/ebookconverter/issues/38 as completed.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#event-10449588571, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLSHIJBDHG2NF7FBNNLX3XA5NANCNFSM6AAAAAA3FPPP2A . You are receiving this because you authored the thread.Message ID: @.*** com>
Filename extension in .zip needs to be fixed. Generally, I am interested in (1) ditching .utf8 and (2) having generated HTML end in .htm.
I would like to just remove the .utf8
from the name of the generated file. But...
I prefer .html
because it helps distinguish it from the submitted .htm
files, also, changing to .htm
will require a number of code changes to avoid breaking links.
unfortunately, changing the generated file name might also not be simple; just changing the file name to .html
inside the zip will be simpler. Although, I'll look to see if it's trivial.
Ok, that's fine. I'm not that fussy about .html versus .htm, but it's very user-unfriendly to have .utf8 in the packaged .zip.
On Thu, Sep 28, 2023 at 11:40 AM Eric Hellman @.***> wrote:
I would like to just remove the .utf8 from the name of the generated file. But... I prefer .html because it helps distinguish it from the submitted .htm files, also, changing to .htm will require a number of code changes to avoid breaking links. unfortunately, changing the generated file name might also not be simple; just changing the file name to .html inside the zip will be simpler. Although, I'll look to see if it's trivial.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1739830578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX46ECRRZJYK6PONKDX4XAC3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>
I've verified that the magic you have to remove the '.utf8' will serve the '.html' file if it's there, so changing the name of the generated file will be a one-liner.
after a month, we can remove the magic and save on one round trip of redirect and perhaps a disk seek.
I'd appreciate if you could look at the config magic to double check this.
On Sep 28, 2023, at 3:01 PM, Greg Newby @.***> wrote:
Ok, that's fine. I'm not that fussy about .html versus .htm, but it's very user-unfriendly to have .utf8 in the packaged .zip.
On Thu, Sep 28, 2023 at 11:40 AM Eric Hellman @.***> wrote:
I would like to just remove the .utf8 from the name of the generated file. But... I prefer .html because it helps distinguish it from the submitted .htm files, also, changing to .htm will require a number of code changes to avoid breaking links. unfortunately, changing the generated file name might also not be simple; just changing the file name to .html inside the zip will be simpler. Although, I'll look to see if it's trivial.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1739830578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX46ECRRZJYK6PONKDX4XAC3ANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1739856454, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMNOIPNTALUOTTA7VLLX4XCQ7ANCNFSM6AAAAAA3FPPP2A. You are receiving this because you modified the open/close state.
Sorry I'm missing something: where should I look?
I thought maybe this was in the .htaccess equivalent, but didn't see it there. If you could give me a little guidance, I will dig into it. Thanks.
I thought it was in .htaccess
On Sep 29, 2023, at 1:31 PM, Greg Newby @.***> wrote:
Sorry I'm missing something: where should I look?
I thought maybe this was in the .htaccess equivalent, but didn't see it there. If you could give me a little guidance, I will dig into it. Thanks.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741258974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMMWWG4SCUAUQS6YSNTX44AXLANCNFSM6AAAAAA3FPPP2A. You are receiving this because you modified the open/close state.
If that is what you were thinking, my guess is that a file named .html rather than .html.utf8 will just work. Give it a try. For example, by renaming a file in cache/epub.
There is a rule in .htaccess to redirect .utf8 text files, but nothing for .htm or .html: RewriteRule ^ebooks/([0-9]+).txt.utf-8$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1.txt [L,R]
However, if someone followed a link to .txt, like to .htm or .html, it should "just work." Apache certainly knows how to serve up text and HTML files.
I seem to think there used to be rules for .html.utf8. It's possible there is something in the Apache configuration about this (not .htaccess, but the system configuration). Easy enough to just test and confirm.
Since .htm files in 1/2/3/ are served correctly, I am betting that .html files in cache/epub will also be served correctly. Without needing to add .utf8 to the end of the filename.
Let me know if I might be missing something about this. Message ID: @.***>
yes, that's what I did, it works.
I just want to confirm that the rule does what we think it does.
I'd like to know why it redirects twice
On Sep 29, 2023, at 5:15 PM, Greg Newby @.***> wrote:
If that is what you were thinking, my guess is that a file named .html rather than .html.utf8 will just work. Give it a try. For example, by renaming a file in cache/epub.
There is a rule in .htaccess to redirect .utf8 text files, but nothing for .htm or .html: RewriteRule ^ebooks/([0-9]+).txt.utf-8$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1.txt [L,R]
However, if someone followed a link to .txt, like to .htm or .html, it should "just work." Apache certainly knows how to serve up text and HTML files.
I seem to think there used to be rules for .html.utf8. It's possible there is something in the Apache configuration about this (not .htaccess, but the system configuration). Easy enough to just test and confirm.
Since .htm files in 1/2/3/ are served correctly, I am betting that .html files in cache/epub will also be served correctly. Without needing to add .utf8 to the end of the filename.
Let me know if I might be missing something about this. Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741497504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHCGMNBEAKZQ6JC7MUMMV3X443AHANCNFSM6AAAAAA3FPPP2A. You are receiving this because you modified the open/close state.
Great!
Could you please quote the rule you are referring to?
On Fri., Sep. 29, 2023, 2:45 p.m. Eric Hellman, @.***> wrote:
yes, that's what I did, it works.
I just want to confirm that the rule does what we think it does.
I'd like to know why it redirects twice
On Sep 29, 2023, at 5:15 PM, Greg Newby @.***> wrote:
If that is what you were thinking, my guess is that a file named .html rather than .html.utf8 will just work. Give it a try. For example, by renaming a file in cache/epub.
There is a rule in .htaccess to redirect .utf8 text files, but nothing for .htm or .html: RewriteRule ^ebooks/([0-9]+).txt.utf-8$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1.txt [L,R]
However, if someone followed a link to .txt, like to .htm or .html, it should "just work." Apache certainly knows how to serve up text and HTML files.
I seem to think there used to be rules for .html.utf8. It's possible there is something in the Apache configuration about this (not .htaccess, but the system configuration). Easy enough to just test and confirm.
Since .htm files in 1/2/3/ are served correctly, I am betting that .html files in cache/epub will also be served correctly. Without needing to add .utf8 to the end of the filename.
Let me know if I might be missing something about this. Message ID: @.***>
— Reply to this email directly, view it on GitHub < https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741497504>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAHCGMNBEAKZQ6JC7MUMMV3X443AHANCNFSM6AAAAAA3FPPP2A>.
You are receiving this because you modified the open/close state.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741521148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLX26STZQINII4MPRLDX446ONANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>
for example, a request to /cache/epub/111/pg111-images.html gets redirected to /cache/epub/111/pg111-images.html (no change) and I'm assuming this is some interaction between apache and a load balancer
I haven't found the rule that does this
it seems that
AddCharset UTF-8 .utf8
captures the extension and sets the the encoding in the content-type header
But this is the first rule triggered:
RewriteRule ^ebooks/([0-9]+)\.html\.images$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1-images.html [L,R]
so the question is whether we need to add a config to set utf-8and reproduce the existing behavior.
Just to confirm: What we're talking about is having generated HTML files end in .html rather than the current ending of .html.utf8. Right?
I am confident we don't need any new Apache rule for this. It is bread-and-butter behavior for the web server, out of the box.
The rule you cite does, indeed, rewrite the URL for .html.images to .html. I don't think that's related to files ending in .utf8.
From your example, I did not see a redirect just now:
$ wget https://www.gutenberg.org/cache/epub/111/pg111-images.html --2023-09-29 16:57:50-- https://www.gutenberg.org/cache/epub/111/pg111-images.html Resolving www.gutenberg.org (www.gutenberg.org)... 2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47 Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 551065 (538K) [text/html] Saving to: ‘pg111-images.html’
pg111-images.html 100%[===================>] 538.15K 768KB/s in 0.7s
2023-09-29 16:57:51 (768 KB/s) - ‘pg111-images.html’ saved [551065/551065]
On Fri, Sep 29, 2023 at 3:17 PM Eric Hellman @.***> wrote:
it seems that AddCharset UTF-8 .utf8 captures the extension and sets the the encoding in the content-type header But this is the first rule triggered: RewriteRule ^ebooks/([0-9]+).html.images$ %{HTTP:X-Forwarded-Proto}://%{HTTP_HOST}/cache/epub/$1/pg$1-images.html [L,R] so the question is whether we need to add a config to set utf-8and reproduce the existing behavior.
— Reply to this email directly, view it on GitHub https://github.com/gutenbergtools/ebookconverter/issues/38#issuecomment-1741542603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFQRDLVIHOT5SZQ2XHA7RWLX45CIDANCNFSM6AAAAAA3FPPP2A . You are receiving this because you modified the open/close state.Message ID: @.***>
everything looks correct
I'm not sure if this is for ebookconverter or elsewhere in the processing chain.
For HTML and plain text, ebookmaker adds the header+metadata and footer to posted books.
For HTML, I would like to stop pushing -h.zip and instead have that created after the header+metadata and footer is added. The -h.zip can then go in cache/epub/xxx rather than in 1/2/3/...
The *-h.zip is a very useful format since it allows download of the HTML file plus all assets. But I'd like it to have the correct up-to-date metadata from the catalog whenever the other generated formats are built.