Closed julianorrish closed 4 months ago
Created https://farm.openzim.org/recipes/bookdash_en_all but the scraper failed. Opened ticket 142 at zimit/issues.
Closing this ticket here as the recipe has been created and will run again when the bug/issue has been fixed.
I just requested the recipe with Zimit2, looks like the ZIM is working with Zimit1 but we forgot to review this and move the recipe back to prod.
In fact only "online" (in the website) reading of the book works. Downloading a copy does not work because all content is on dropbox and cannot be scrapped easily.
Audio reading seems to be broken, at least no books are displayed as containing an audio version. Maybe ZIM is too old and audio did not existed at that time?
Zimit2 ZIM is not yet in dev library due to another bug in Zimit2 (probably solved soon).
Downloading a copy does not work because all content is on dropbox and cannot be scrapped easily.
We have a relationship with the founders. Do you think it worth suggesting the repatriate their material something more easily scrapable?
We have a relationship with the founders. Do you think it worth suggesting the repatriate their material something more easily scrapable?
Looks a bit weird to me to suggest that, I think they have their reasons to host their content on Dropbox, and these reasons are probably good from an overall point of view. At least as an IT architect I would probably not recommend them to move out of Dropbox. We can still inform them of our problem with this fact, and see if they have an alternative to offer or a plan already in head. What could maybe work and be useful from a usability perspective is to directly link the PDF behind the "Read this book" button instead of linking the folder where the PDF is placed. But maybe this is done intentionally because in many cases there is multiple files.
And one more general remark: are we sure we want the PDFs in addition to the "online" version? I mean, it will probably make the ZIM significantly bigger ... and it is already 12G. An alternative could be to simply hide (with CSS) all these "broken" links.
Ah hiding would definitely be best, yes
So hiding is not easy because there is very little info in HTML we can bind to in order to develop the proper CSS selector to hide these buttons.
@Popolechien could we maybe ask the founders to add id
property on buttons so that it will be easier for us to hide these buttons?
Currently HTML code on a random book looks something like this:
<div class="grid_1-2-md">
<p>Khanyiswa is not the first born. She’s not the last born. She’s the one in the MIDDLE!</p>
<div class="display-js-b">
<a href="#read-book" class="btn btn--block btn--primary mb-3 jsExposeToggle" data-expose-target="#read-book"
data-expose-modal="true">Read this book</a>
</div>
<div class="txt-center mb-3">
<a href="https://www.dropbox.com/scl/fo/3rf5mw1pn228r7ivrnll4/h?rlkey=iqiuos9zicm5h13s9wv955jkw&dl=0"
class="btn btn--primary txt-md py-2">Download ebook</a>
</div>
<div class="txt-center mb-3">
<a href="https://www.dropbox.com/scl/fo/i7z8isn7qqim7nhby38b2/h?rlkey=yfuei5y7e41wv0zbrmb9r4d0z&dl=0">Download
this book's source files</a>
</div>
<div class="mt-2 mb-3 txt-center">
<ul class="list list--csv">
<li><a href="../../languages/eng">English</a></li>
</ul>
<p class="m-0 txt-sm">ISBN: 978-1-77623-126-3</p>
<p class="m-0 txt-sm">BISAC: JUV013000, JUV015000, JUV039000</p>
</div>
</div>
It would help a lot is they could add ids like this:
<div class="grid_1-2-md">
<p>Khanyiswa is not the first born. She’s not the last born. She’s the one in the MIDDLE!</p>
<div id="read_button" class="display-js-b">
<a href="#read-book" class="btn btn--block btn--primary mb-3 jsExposeToggle" data-expose-target="#read-book"
data-expose-modal="true">Read this book</a>
</div>
<div id="download_button" class="txt-center mb-3">
<a href="https://www.dropbox.com/scl/fo/3rf5mw1pn228r7ivrnll4/h?rlkey=iqiuos9zicm5h13s9wv955jkw&dl=0"
class="btn btn--primary txt-md py-2">Download ebook</a>
</div>
<div id="sources_button" class="txt-center mb-3">
<a href="https://www.dropbox.com/scl/fo/i7z8isn7qqim7nhby38b2/h?rlkey=yfuei5y7e41wv0zbrmb9r4d0z&dl=0">Download
this book's source files</a>
</div>
<div id="book_meta" class="mt-2 mb-3 txt-center">
<ul class="list list--csv">
<li><a href="../../languages/eng">English</a></li>
</ul>
<p class="m-0 txt-sm">ISBN: 978-1-77623-126-3</p>
<p class="m-0 txt-sm">BISAC: JUV013000, JUV015000, JUV039000</p>
</div>
</div>
Some audio is also missing, I've updated the recipe configuration to also include these audios.
And I've updated the URL to https://bookdash.org/books/ instead of https://bookdash.org/ so that it is more straightforward which content to expect in this ZIM
So I've reached out and here is the answer I got:
We are moving all our books onto AWS, to host them there rather than Dropbox. (They are too large to host on the Wordpress site itself, and it's not a great tool for content management).
We're pretty far in a process, and all books are on AWS, but we're now figuring out how best to replace all the links on the website (currently pointing to the book's Dropbox location) to direct to the AWS files/folders instead.
We are also building a user interface on Wordpress site that allows users to browse the backend folder structure directly on AWS. This will mean that partner organisations who want to navigate or download many books at once could have a traditional folder-type view, rather than having to navigate around the website front end (which is designed primarily for easy reading, but not bulk downloads).
We're hitting some roadblocks in this process with generating zipped versions of some of the very large source illustration and design files, but we're making good progress and I imagine we'll action the updates on the website in the next few weeks.
So, my question is whether your challenge will still be the same even once the books are hosted on AWS? I know it has better integration potential than Dropbox, but it might be that in your case this doesn't change much.
If that's the case, we can definitely consider assigning a specific ID to our buttons (and specifically to the "Download ebook" buttons, or whichever ones you wanted hidden) because it's a custom field on Wordpress that our developer created. If it comes to that, I'll put the two of you in touch, and he can determine the cost
@benoit74 Thoughts?
Let's see how this works once on AWS, but if I read the message properly, it is probably a good news meaning that we will be able to ZIM everything and not have to hide the buttons, only the search box. Plus it means we can probably develop a custom scraper should we need to.
Dev library has a new ZIM: https://dev.library.kiwix.org/viewer#booksdash_en_all_2024-06/
Shall we move this ZIM (and the recipe) to PROD ? It is pretty good, using Zimit2, and just "as-broken" as the file currently in PROD ("read this book" button is working, but not "download ebook" and "Download this book's source files")
LGTM.
I moved ZIM to prod: https://download.kiwix.org/zim/zimit/booksdash_en_all_2024-06.zim, library will update soon.
I also renamed the recipe to https://farm.openzim.org/recipes/booksdash_en_all to match the ZIM name.
Let's close this issue (ZIM is now published and automatically updated) and create a new one to track the fact that we would like to fix the ZIM broken links.
Please use the following format for a ZIM creation request (and delete unnecessary information)