openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
284 stars 23 forks source link

Issues with task 672c1 #190

Closed sgpopovich closed 1 year ago

sgpopovich commented 1 year ago

Dear IT Support Team,

I am writing to report several issues I have been experiencing while trying to access the att.digitallearn.org_5121d011.zim file (task 672c1) in Kiwix JS PWA 2.8.81 (On Windows 10). The file is approximately 731.14 MiB in size. Below are the problems I have encountered:

Inconsistent File Opening: I have noticed that the file behaves inconsistently. Sometimes, when I attempt to open it, I am able to access the content without any problems. However, on other occasions, I am presented with a blank white screen, preventing me from viewing any information.

Non-English Content: When I do manage to access the initial main page, I have encountered non-English content within the file. Specifically, I have come across sections in Spanish and French, which makes it difficult for me to understand and utilize the educational material effectively.

Unavailable Audio/Video Educational Content: Even when I successfully access the main page of the file, I have noticed that the educational content, particularly in the form of audio and video, is not available. This greatly diminishes the learning experience and hinders my ability to benefit from the intended multimedia content.

I have attempted troubleshooting steps such as reinstalling Kiwix and clearing the cache, but these measures did not resolve the aforementioned issues.

Could you please investigate these matters and provide a resolution to ensure smooth and uninterrupted access to the att.digitallearn.org_5121d011.zim file? Much appreciated.

If any additional information is required, or if there are alternative steps I can take to resolve these issues, please do not hesitate to let me know.

Thank you for your time and support.

Sincerely, Steven

rgaudin commented 1 year ago

Ping @jaifroid

Jaifroid commented 1 year ago

@sgpopovich Thank you for the report. The implementation in the Kiwix PWA currently is experimental. We are actually working right now on a more robust solution. Do you have a link to the problematic ZIM, so I can see if there's something obvious that can be patched in the meantime?

sgpopovich commented 1 year ago

Yes, I understand it is experimental - I was hoping luck would be on my side as I wanted to share content with schools in Tanzania (a project I am currently working on).

Here is the link to the ZIM file in question. https://www.dropbox.com/s/025rfeo6xjxv78h/ATT-Digital-Learn.zim?dl=0

Here is a link to the original website content

Thanks for any help you can provide.

sgpopovich commented 1 year ago

Original content link https://att.digitallearn.org

Jaifroid commented 1 year ago

Got it, thanks, I'm taking a look now.

Jaifroid commented 1 year ago

@sgpopovich I've corroborated some of your issues:

  1. Content in Spanish/French: yes, but this is a problem with the scraper, not with the reader. It appears to have scraped some of the content in Spanish. When reading the file via Kiwix Serve (the reference player here), I see the same content in Spanish or French. This means that it was scraped this way, and there is nothing the reader can do to change that. It may be that setting some options in the scraper could help. The affected content, however, is only parts of the UI, so although it's a bit annoying, I suppose it's not a critical bug.
  2. Playing multimedia: I confirm that the PWA reader is failing to load the media (presentations/videos). I'm looking in to patching that.
  3. Inconsistent file opening: So far, I haven't been able to corroborate this. Could I check which browser / version you are using? On Firefox 113, Chrome/Edge 113 and Chromium 90, the archive appears to open consistently for me in pwa.kiwix.org.
Jaifroid commented 1 year ago

@sgpopovich It's been suggested that if you run a scrape specifying the language as "eng", in the advanced options, it might filter only English. Did you already try that? The problem appears to be that the home page has a link to switch the interface to Spanish, and that link has clearly been spidered. Unfortunately the link doesn't change the URL, so the spider has probably visited both the English and Spanish versions.

Jaifroid commented 1 year ago

Regarding inconsistent file opening, please be sure that you only have one instance of the Kiwix JS PWA open in your browser, e.g. in different tabs or windows of the same browser/PWA. Having an instance open in another browser is no problem. But there is currently some interference due to the global parameters and settings if you have more than one instance open, and this can result in difficulty opening the Landing page or other articles.

kelson42 commented 1 year ago

@sgpopovich Considering this is a valuable free content, IMO we should consider build this ZIM file in our Zimfarm. @Popolechien What do you think?

Jaifroid commented 1 year ago

I've been testing this ZIM with Kiwix Serve (which should be able to play any Zimit ZIM), and it seem that some resources in the lessons have not been fully scraped, or else the spider did not wait long enough on each lesson (each of which runs with an autoplay) to record all the requests and responses. So it might be a timeout issue related to each exercise.

The symptoms are that if I shut off my Internet connection while playing the ZIM, then the lessons appear to work at first, but after 30 seconds or so of autoplay (I didn't time it precisely), some of the resources/images/animations are no longer functioning.

I think this is similar to the issue with lazy-loaded images when scraping some other web sites: the spider needs to scroll each page on an article slowly to the end and leave sufficient time for each image to be loaded. In this case, the spider would need to wait for each autoplayed lesson to finish to be sure of recording all the required requests and responses.

Popolechien commented 1 year ago

@kelson Yes I agree. @sgpopovich Would you mind dropping us an email at hello at kiwix.org to describe your Tanzania project? We're trying to map out deployments

rgaudin commented 1 year ago

@sgpopovich thank you for the report ; here's some information regarding the double langue issue:

To prevent this from happening you should request the crawler not to follow the link to /home_language_toggle?lang=es.

@Jaifroid indeed there are options to customize the waiting behavior of the scraper. The problem here being that the website uses a custom video player and not the browser's one.

It looks like the slideshows won't work either as slides and audio are loaded asynchronously (same issue as video) but it also seems that the rest of the slideshow awaits a click to Next Slide to fetch the additional resources… that we can't tell instruct the scraper to do.

sgpopovich commented 1 year ago

Jaifroid I am using Micro Edge Version 113.0.1774.50 Official Version (64-bit). But I do not understand the context of your question as I am accessing the content (att.digitallearn.org_5121d011.zim) from within Kiwix. Explain please.

sgpopovich commented 1 year ago

Regarding inconsistent file opening, please be sure that you only have one instance of the Kiwix JS PWA open in your browser, e.g. in different tabs or windows of the same browser/PWA. Having an instance open in another browser is no problem. But there is currently some interference due to the global parameters and settings if you have more than one instance open, and this can result in difficulty opening the Landing page or other articles.

Jaifroid - Based on your statement above, I may be misusing the Kiwix app. I am running it as an App on Windows 10. Not in a browser. See attached file for what this looks like on my laptop. Kiwix-2023-05-24

sgpopovich commented 1 year ago

@sgpopovich Considering this is valuable free content, IMO we should consider building this ZIM file in our Zimfarm. @Popolechien What do you think?

Kelson42, this content is highly valuable, especially considering the lack of digital literacy prevalent in East Africa. This issue is likely common across developing economies worldwide. I fully support your idea, as it would eliminate the need for technical challenges associated with scraping the content. Moreover, I know several organizations that would greatly benefit from accessing your content directly, making their work much easier.

By the way, I also have another website containing a large collection of free valuable educational content that you might find interesting. Currently, we only have it organized in file folders, but having it available on Kiwix would be immensely valuable. You can find the content at this link: https://bestedlessons.org/2021/09/21/gcf-learn-2000-free-lessons-video-courses/#:~:text=To%20access%20all%20of%20these,default%20folder%20with%20same%20name. Let me know if you think it can be added to the farm.......?

sgpopovich commented 1 year ago

@Kelson Yes I agree. @sgpopovich Would you mind dropping us an email at hello at kiwix.org to describe your Tanzania project? We're trying to map out deployments

Kelson, I am working on the Email and will send it to "hello at kiwix.org". Thanks

kelson42 commented 1 year ago

@sgpopovich Can you please one ticket per content you would like consider to scrape and publish at https://github.com/openzim/zim-requests ?

kelson42 commented 1 year ago

I have created https://farm.openzim.org/recipes/att_connected_learning_en, let see what we can achieve to do....

sgpopovich commented 1 year ago

@kelson42 Kelson, can you let me know when att_connected_learning_en is ready to try and where to find it?

Jaifroid commented 1 year ago

Jaifroid - Based on your statement above, I may be misusing the Kiwix app. I am running it as an App on Windows 10. Not in a browser.

@sgpopovich You are using it correctly, I just wanted to know in which browser version you had installed the PWA, which you've answered. I can't reproduced the error with opening the main page for now, but let's wait and test the new version. Video/audio content is a challenge to support in these ZIM types, but it can't be properly tested and/or patched until we have a ZIM version that contains all the slides/presentations properly scraped.

Jaifroid commented 1 year ago

@sgpopovich You'll be happy to know that the new run that @kelson42 made was successful in scraping only the English-language UI. The ZIM is here: https://mirror.download.kiwix.org/zim/.hidden/dev/att-connected-learning_en_all_2023-05.zim . I'm testing the other issues mentioned above.

Popolechien commented 1 year ago

Sounds seems to be missing as well as, weirdly enough, part of the video. Comparing for instance https://att.digitallearn.org/courses/basic-search-ca4d47b1-9ddf-4446-a5d2-3e12de97595f/lessons/a-basic-search Capture d’écran 2023-05-25 à 09 23 49 and what zimit came up with at https://dev.library.kiwix.org/viewer#att-connected-learning_en_all_2023-05/A/att.digitallearn.org/ (to find the exact course go to Basic Searchand then 2. Basic Search

Capture d’écran 2023-05-25 à 09 23 41

Jaifroid commented 1 year ago

Yes, unfortunately the timeout is probably not long enough to wait for all content to load in the lessons. Also, anything interactive (i.e. that demands clicking on elements inside the lesson screen) is unlikely to be spidered. However, @kelson42 will try re-running this with a longer timeout to see if it can collect all/most of the resources of each lesson. Fortunately the lessons appear to autoplay, so waiting long enough should catch most requests and responses, but it may never be fully interactive.

sgpopovich commented 1 year ago

@Jaifroid I tried @Popolechien content above - it is looking good. I appreciate how far you have gotten wit this.

Jaifroid commented 1 year ago

@sgpopovich Although it's better, it seems clear that the content of lessons (part of it) is not fully scraped. If you are seeing complete lessons, it's probably because it's getting some of the resources directly from the Internet rather than from the ZIM file, which is of course a problem in fully offline contexts. We'll see if a new run with a longer timeout manages to get more of the lesson content into the ZIM.

sgpopovich commented 1 year ago

@Jaifroid @Popolechien - Dev Team: what is the status of where we are at? Has work stopped or continue to investigate?

Jaifroid commented 1 year ago

@sgpopovich I think we were waiting for a new version of ZIM to see if increased timeouts fixes the scraping of the lessons. I think this is the new ZIM: https://mirror.download.kiwix.org/zim/.hidden/dev/att-connected-learning_en_all_2023-05.zim, it's 703MB compared to the previous one of 650MB, so it's clearly got more stuff in it. Need to test...

Popolechien commented 1 year ago

I still get the same issue as before (no sound, video or parts of it missing).

Jaifroid commented 1 year ago

OK, I've just tested, and as @Popolechien confirms, it's still not scraping everything (in fact, for the lessons I tested, I didn't see much improvement).

While we'll open a ticket on the relevant Replay GitHub, and will link to it here, this ZIM is challenging to scrape in an automatic way because it has interactive elements in the lessons, and the spider isn't intelligent enough to interact with the lessons in the way that a human would. The way it works is that it records the visit to the web site, and it clicks on anything it can click on, but it can't interact with content inside a lesson.

However, we can suggest that you consider doing a "manual" scrape using the ReplayWeb Recorder here: https://[archiveweb.page](https://archiveweb.page/)/ (documentation here: https://archiveweb.page/guide).

What you would need to do is perhaps a bit boring: you'd need to use the recorder to record a "complete" visit to the Web site. This means you'd need to interact with the Web site and click on each lesson, let it run through to the end, interact as necessary with the lessons. Then you would need to export the resulting recording to an archive, either in WARC format or in WARCZ format. The WARCZ could then be distributed, and your users would use the Replay Web PWA to play them back (rather than using Kiwix).

Potential challenges are:

  1. The size of the WARCZ that you create. Possibly your browser might not be able to cope with such a large file stored internally in the browser Cache and then exported. However, it may not be a problem in practice. Only trying it can tell.
  2. Users may not have powerful-enough browsers to read the resulting file (again, this needs testing). Unlike Kiwix, Replay Web requires the latest browsers.
  3. It may not be possible to record the whole site. However, it may be possible to create a number of smaller WARCs of each group of lessons.

What I would say is that you should test this solution first on a few lessons before attempting to scrape the full web site manually in this way.

If you are still interested in distributing a ZIM, then you'd need to save the archive to a WARC format rather than a WARCZ format. We (or you) could then convert it to a ZIM using our warc2zim tool.

However, I have to say that for your purposes, it might be simpler just to distribute the WARC(Z) file directly and use Replay Web PWA (or Chrome extension) as the player instead of Kiwix. We currently don't have a simple-enough way to play back a Zimit (warc-based) ZIM that children could operate, because the Kiwix JS PWA is experimental and doesn't have complete support for audio/video yet, and the Kiwix Serve solution (which can replay autdio/video) is too complicated for children.

What is your use case? Clearly the school children would need some way of getting hold of the ZIM or WARC(Z) and the software to play it. Is the idea that they'd download this while in school, and then take it home to use offline? Or would you be setting it up in on a school intranet, in which case you might need a solution like Kiwix Serve (which can fully play ZIM archives and serve the resource transparently to users' browsers)?

Jaifroid commented 1 year ago

PS We are working on a simpler solution for replaying Zimit (WARC-based) ZIMs in a high-fidelity manner. Depending on your timeline, if you are prepared to wait, we should have that solution soon, but that could be several weeks or more. But first, you'd need to be sure that you can scrape the lessons in Replay Web manually, and that you can get a good quality recording that can be replayed with all the interactive features.

sgpopovich commented 1 year ago

@jaifroid. I tried it and it still does not have any of the video. Steven G. Popovich | LinkedIn https://www.linkedin.com/in/stevengpopovich/

On May 26, 2023, at 12:05 PM, Jaifroid @.***> wrote:

@sgpopovich https://github.com/sgpopovich I think we were waiting for a new version of ZIM to see if increased timeouts fixes the scraping of the lessons. I think this is the new ZIM: https://mirror.download.kiwix.org/zim/.hidden/dev/att-connected-learning_en_all_2023-05.zim, it's 703MB compared to the previous one of 650MB, so it's clearly got more stuff in it. Need to test...

— Reply to this email directly, view it on GitHub https://github.com/openzim/zimit/issues/190#issuecomment-1564612422, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYFY3HW3HTISZDHH33I56EDXIDIERANCNFSM6AAAAAAYMDZDHU. You are receiving this because you were mentioned.

Jaifroid commented 1 year ago

@sgpopovich The good news is that archiveweb.page (I used the Chrome extension in Edge) can record the lessons/videos with sound and the replayweb.page PWA can replay the WARC or WACZ exported. I did a test WARC here, scraped using the manual method outlined above and exported as WARC v1.1 (you can also export as WACZ). I only scraped the first two lessons:

To replay this, download it, go to https://replayweb.page/ in any modern browser, and load the attlearning.warc you downloaded. You can install replayweb.page as a PWA in Chromium browsers, or bookmark it in Firefox. In my test, it worked with my PC offline even when restarting the browser and going straight to https://replayweb.page/ in Firefox without reactivating the Internet. Of course you have to have visited replayweb.page at least once for the PWA to cache itself. You could also use the Chromium extension to replay the archive rather than relying on the PWA's self-caching.

The not so good news is that, as I said above, to scrape the full site, you'd probably need to go manually through all the lessons and interact with everything that can be interacted with.

Unless you have a really strong reason to distribute a ZIM as opposed to a WARC/WACZ, I think you now have a viable solution.

Jaifroid commented 1 year ago

For internal Kiwix testing, I made a ZIM from this WARC using the dev version of the Zimit docker image of warc2zim:

The videos/lessons play fine in Kiwix Serve, which proves that the problem is with the automatic scraping.

(As expected, and as warned in the "experimental" notice provided in the PWA, video/audio content doesn't work in the PWA, as I only currently have support for YouTube sources. This is an issue with the PWA which I'm working on.)

Jaifroid commented 1 year ago

@sgpopovich Did the above procedure solve the issue for you?

For the sake of completeness, I thought I'd let you know that v2.5.0 of the UWP version of Kiwix JS, running in Service Worker mode, can now play the manually scraped archive linked in my previous comment, with video and sound, fully offline (see screenshot). This version only works on Windows 10 or 11, and is available either from the Microsoft Store or by direct download of the appxbundle. To run in Service Worker mode, the app needs one-time access to the Internet on first launch to cache the PWA code.

NB The manually scraped ZIM only contains the first lesson.

I am unsure why the Chromium PWA version, running the same code, cannot yet play the lesson. But the fact that the UWP version can suggests that there is no fundamental blocker to fixing this. The main difference is that the UWP version uses the old Edge Legacy web view, rather than Chromium.

image

Jaifroid commented 1 year ago

Issue should probably be closed now (I don't have access to do that). Any followup on the specifics of the UWP vs PWA rendering of the test ZIM can be discussed on https://github.com/kiwix/kiwix-js-windows/issues/420.

tempo660 commented 10 months ago

Hi. I'm receiving an Internal Server Error when attempting to submit a ZIM conversion request. I assume the site is just temporarily down? Unable to create schedule: 500: Internal Server Error.

rgaudin commented 10 months ago

@tempo660, this ticket and closed and your message concerns a different issue. You should have opened a separate ticket. Can you please describe exactly what you are trying to do? Which URL gives you a 500 error and what did you input?

tempo660 commented 10 months ago

@rgaudin Apologies I just felt it didn't warrant it's own Issue. The URL I attempted to submit is this. I filled in some additional fields too.

Language=eng
Title=pilkipedia
ZIM filename=pilkipedia
rgaudin commented 10 months ago

Just tested with those exact values ; no error

rgaudin commented 10 months ago

OK, I now understand your message. Your issue is with the website returning 500 Errors when crawling. I think you should report that to the website. It doesn't appear related to zimit at all.

tempo660 commented 10 months ago

I assume webrecorder.net is where I need to go to contact the site master. There's scant info on the zimit site about who to contact.