openzim / ifixit

iFixit to ZIM scraper
GNU General Public License v3.0
25 stars 3 forks source link

Download attached PDF documents when opening page #108

Open redtux opened 1 month ago

redtux commented 1 month ago

Hi, I stumbled over a dead link in the ZIM file.

The iFixit German archive has just been downloaded on a fresh install of Kiwix.

Now I wanted to verify that the page is really missing in the wiki, but "unfortunately" it's not. 🙂

Here's the link: https://de.ifixit.com/Device/Lenovo_ThinkPad_T460p

Original link: /Document/PDhLL6RFxYZE3Hre/t460p_hmm_en_sp40k04964_02.pdf

(The above was copied from the Kiwix error.)

Screenshot_2024-07-17-22-27-51-432_org.kiwix.kiwixmobile.jpg

redtux commented 1 month ago

Okay, I should have checked before:

This issue occurs with many devices of various vendors. Does it make sense to make a list with working and broken pages?

benoit74 commented 1 month ago

This is not a really bug, it is a limitation ^^

As mentioned in the error message, current scraper simply does not retrieve this kind of items into the ZIM. This would need additional efforts.

It is however new to me that we now have "Documents" in ifixit guides, we will need to check that. Thank you for reporting, and no need to list pages without this. Unless I missed something, I think that all "Documents" are just missing.

redtux commented 1 month ago

Thank you for the quick reply, and for the clarification! I changed the title now, so this might be considered a feature request. 🙂

Would the respective wiki page be scraped if it contained no PDF attachment? There is also normal text missing; the PDF issue was just an additional information.

If downloading the PDF in addition to the ZIM is out of scope for this project (as that seems to be a client task), maybe PDFs (or rather all kinds of attachments) could be treated as external links?

That way my system is responsible for downloading the PDF and opening it with my default PDF reader.

benoit74 commented 1 month ago

Do you have an example of normal text missing? This is not really expected.

Philosophy so far has been to focus on what is really important for an offline user (categories, guides, ...) and postpone to "later" what is less important: items (parts and tools), wikis, ...

I had a quick look and documents seems to become a very important part of iFixit now that some companies are providing these to iFixit. I think we should "urgently" add support for these. Most our users are offline and won't be able to use the external link. I cannot provide an ETA however, hopefully in the coming months.

You speak about other kind of attachments, do you have an example? Is it still a document (i.e. in a Documents section, and served on a /Document url)?

redtux commented 1 month ago

Sorry for the confusion, I did not have any other file formats in mind. 🙂 Pictures seem to be fetched fine, and I saw no videos or any other attachments than PDFs.

Concerning the missing text, I was referring to my initial link: https://de.ifixit.com/Device/Lenovo_ThinkPad_T460p

This page contains at least a summary, a TOC, and some categories. As already shown by the above screenshot, this information seems to be missing. I could test this with some more pages if needed. 👍

And yes, I perfectly understand that external links are not a real solution (not even a workaround) — but as a short-term "hack" (until this gets solved) it might be considered more intuitive for new users than the information currently displayed (which I obviously did not fully understand without your explanation in this issue 🫢). What do you think?

Thanks for the great work — and in case I could help with some more testing, just let me know pls.