gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

Http_server: Fix images not downloading on some Portal pages (images sometimes not appearing) #686

Open desb42 opened 4 years ago

desb42 commented 4 years ago

As described by @Ope30 in #680, the page de.wikipedia.org/wiki/Portal:Wikipedia_nach_Themen seems to be inconsistent in displaying images

I have seen this before in other wikis

Taking this page as an example, the image to the right of Geographie is chosen 'randomly' from a list of 5 (in this case) images. The wikitext is:

{{Zufallsbild
| ANZAHL = 5 | SAAT = 1
| 1 = [[Datei:Views of Geneva.jpg|right|150px|Genf]]
| 2 = [[Datei:Hn-caecilien66-web.jpg|right|150px|Villa Faißt in Heilbronn]]
| 3 = [[Datei:Collage of views of Poznan, Poland.jpg|right|150px|Posen]]
| 4 = [[Datei:Arrasate-mondragon.jpg|right|150px|Baskenland]]
| 5 = [[Datei:Akihabara Electric Town 2.jpg|right|150px|Tokio]]
}}

If I take just the list of images

[[Datei:Views of Geneva.jpg|right|150px|Genf]]
[[Datei:Hn-caecilien66-web.jpg|right|150px|Villa Faißt in Heilbronn]]
[[Datei:Collage of views of Poznan, Poland.jpg|right|150px|Posen]]
[[Datei:Arrasate-mondragon.jpg|right|150px|Baskenland]]
[[Datei:Akihabara Electric Town 2.jpg|right|150px|Tokio]]

and cut out all the rest of the wikitext and replace with these files, when I Show preview (Vorschau zeigen), I get 2 images and three failures

gnosygnu commented 4 years ago

Taking this page as an example, the image to the right of Geographie is chosen 'randomly' from a list of 5 (in this case) images. The wikitext is:....

Yeah, I don't think this is resolvable. I don't know of a way to identify all the images in these "revolving" templates. I remember running across this early on in a random enwiki page for India (it switched the image based on the time of day)

The problem is that the hdump process loads a page only once, and if there is a "revolving" image template only 1 of the many images will be downloaded. I could try scanning the raw template text, but that becomes extremely difficult as you could get things like "{{random_template|Views of Geneva.jpg|Hn-caecilien66-web.jpg}}" which would need template parsing.

For now, I'll leave this as a known issue in the backlog. Let me know if any other thoughts. Thanks

desb42 commented 4 years ago

I have been doing a bit of digging and think I can explain the issue

Taking as an example en.wikipedia.org/wiki/Portal:Arts This has many section that involve random selection

It is not the randomness that is the cause (I believe)

Generating from wikidata, the randomness potentially produces new images to 'download', the download process runs, and then the wikitext is processed a second time

This second time, potentially generates a different set of images - which do not go through another download - hence causing the process not to find a valid image

desb42 commented 4 years ago

In principle, the second pass could be performed on the html generated in the first pass. A bit like hdump?

desb42 commented 4 years ago

In light of the above comment, I have made some changes to a few files to implement this concept

The basic idea is that during the html construction when a file is not in the file subdirectory already, changing the generation of the link to use the hdump formatter and then once the files have been downloaded, passing this generated html through the hdump process (hopefully that make sense)

I have introduced a new function into Xow_hdump_mgr_load.java Parse(src, page) which is called from Http_server_page.java

The other change (a bit hacky) is in Xoh_file_wtr__basic.java I change html_fmtr to use the fmtr__hdump formatter if the current formatter is fmtr__basic and the file does not exist

Please see attached rebuild.zip (definitely a work in progress)

gnosygnu commented 4 years ago

My apologies here. I missed the comments from 2 weeks ago when my email was weird

Thanks for the code files. I took a look at the attached rebuild.zip, and I think it won't handle the html static image dumps. Calling fmtr__hdump may allow the GUI / HTTP_SERVER to show the image, but it won't log the image for the html static image dumper (The main call is here: https://github.com/gnosygnu/xowa/blob/master/400_xowa/src/gplx/xowa/parsers/lnkis/Xop_lnki_wkr.java#L75) . I can alter Xoh_file_wtr__basic to do make this call, but I wanted to reproduce this on my side first.

Generating from wikidata, the randomness potentially produces new images to 'download', the download process runs, and then the wikitext is processed a second time

I tried to debug this further on my side, but with the XOWA GUI and no image databases, all the images on en.wikipedia.org/wiki/Portal:Arts show (They are "random" so each refresh of the page will download new images from the internet). I'll be downloading de.wikipedia.org sometime tonight, so will take a look at de.wikipedia.org/wiki/Portal:Wikipedia_nach_Themen. Is that the best page to witness the behavior in the excerpt above?

desb42 commented 4 years ago

I believe that this behaviour is 'limited' to xowa-http Due to the complete reprocessing of a page if an image is missing (in Http_server_page.java)

The changes I suggested above seem to work in xowa-http but I forgot to see what impact there would be for xowa-gui (which I think does it a different way)

desb42 commented 4 years ago

Attached is a version of Xoh_file_wtr__basic.java that takes account of the application mode This seems to make things OK with xowa-gui

Xoh_file_wtr__basic.zip

desb42 commented 4 years ago

Having been playing with the xowa-gui version and page en.wikipedia.org/wiki/Portal:Arts I have noticed some inconsistent behaviour I start with a fresh build of xowa (xowa_get_and_make.sh) - this deletes all files in the /file/ subdirectory

Start xowa and in Options->Wiki - HTML Databases untick 'Prefer HTML Databases for Read tab' (so as to always use wikitext)

In a new tab request the above page

The page loads and all images load (along with the appropriate text)

However, if within the page, I right click and choose 'Reload Page' the page loads but some images are missing random1

If I go to the address bar and hit carriage return (or enter), the page loads with all (random) images

My version of xowa exhibits the same problem, however, I have added a line of code to Xof_xfer_queue.java, that indicate which file is being downloaded (System.out.println)

When I use 'Reload Page' no images are downloaded, when I hit enter in the address bar, images are downloaded

gnosygnu commented 4 years ago

Cool. Thanks for the updates. I'm running errands tomorrow, so won't get a chance to review till Thursday morning.

gnosygnu commented 4 years ago

Hey, so I tried it today and couldn't reproduce it.

Maybe this is something to do with your forked changes? Could you try with xowa_get_and_make.sh? See my steps below.

Thanks!


Let's assume the XOWA root is something like C:\xowa_latest

desb42 commented 4 years ago

With the original issue - I had a forked change that shows the problem described (my version allows a 'Show preview' from the xowa-http side) Most of the time, I try to reproduce these issues with a fresh build with xowa_get_and_make.sh I agree that following the step described immediately above works fine.

However I have also, in further comments in this post, described other failures (that I believe are related) Specifically my comments on 7th June and 23rd June (Clearing the /file/ cache is an important step)

Have you had an opportunity to try to reproduce those ones?

gnosygnu commented 4 years ago

However I have also, in further comments in this post, described other failures (that I believe are related) Specifically my comments on 7th June and 23rd June (Clearing the /file/ cache is an important step)

Oops. I assumed the first comment was still related to the others. Sorry, my mistake. I should have read the others more closely

Have you had an opportunity to try to reproduce those ones?

I tried now with http://localhost:8080/en.wikipedia.org/wiki/Portal:Arts and see the issue. Let me re-review your commits and work on that next.

Sorry again for not spending a bit more time on going through the other comments. I know how much time you spend on these issues, and the least I could have done was read a little more closely. Will work on this over the next few days. Thanks!

gnosygnu commented 4 years ago

Added commit above. The approach is a bit different, as I ended up adding a new Xoh_wtr_ctx.HttpServer and used it to handle all the hdump logic.

Also, FWIW, your approach was very clever. I didn't actually realize what you were doing until I re-reviewed your changes today. I think if I had to solve the same problem, I would not have come up with this approach -- which is pretty sad considering I wrote both the hdump code.

Anyway, nice job! Sorry again for the misunderstanding above, but thanks many more times for a great fix!

desb42 commented 4 years ago

Having done some further experiments and builds I have noticed a number of tweeks that need consideration

The second pass goes through a (almost) completely built html. This means there are some anchors (\) and image links (\) that have not needed to be considered before (This shows up in the logs) I have made some changes that stop the generation of these messages

I still cannot get enwikivoyage pagebanner images to 'download' properly

Today, however, I have just noticed (its taken this long!) that the Categories section does not display at all

This is due to the fact that the generation of the Categories checks the Hdump status which, now, is always on at that point - hence no Categories

In 400_xowa\src\gplx\xowa\htmls\core\htmls\Xoh_wtr_ctx.java I have introduced a new Mode check Mode_is_hdump_only which just checks the flags {return mode == TID_HDUMP || mode == TID_EMBEDDABLE;}

And changed 400_xowa\src\gplx\xowa\htmls\Xoh_page_wtr_wkr.java to check that