Open desb42 opened 4 years ago
Taking this page as an example, the image to the right of Geographie is chosen 'randomly' from a list of 5 (in this case) images. The wikitext is:....
Yeah, I don't think this is resolvable. I don't know of a way to identify all the images in these "revolving" templates. I remember running across this early on in a random enwiki page for India (it switched the image based on the time of day)
The problem is that the hdump process loads a page only once, and if there is a "revolving" image template only 1 of the many images will be downloaded. I could try scanning the raw template text, but that becomes extremely difficult as you could get things like "{{random_template|Views of Geneva.jpg|Hn-caecilien66-web.jpg}}" which would need template parsing.
For now, I'll leave this as a known issue in the backlog. Let me know if any other thoughts. Thanks
I have been doing a bit of digging and think I can explain the issue
Taking as an example en.wikipedia.org/wiki/Portal:Arts
This has many section that involve random selection
It is not the randomness that is the cause (I believe)
Generating from wikidata, the randomness potentially produces new images to 'download', the download process runs, and then the wikitext is processed a second time
This second time, potentially generates a different set of images - which do not go through another download - hence causing the process not to find a valid image
In principle, the second pass could be performed on the html generated in the first pass. A bit like hdump?
In light of the above comment, I have made some changes to a few files to implement this concept
The basic idea is that during the html construction when a file is not in the file
subdirectory already, changing the generation of the link to use the hdump formatter and then once the files have been downloaded, passing this generated html through the hdump process
(hopefully that make sense)
I have introduced a new function into Xow_hdump_mgr_load.java
Parse(src, page)
which is called from Http_server_page.java
The other change (a bit hacky) is in Xoh_file_wtr__basic.java
I change html_fmtr
to use the fmtr__hdump
formatter if the current formatter is fmtr__basic
and the file does not exist
Please see attached rebuild.zip (definitely a work in progress)
My apologies here. I missed the comments from 2 weeks ago when my email was weird
Thanks for the code files. I took a look at the attached rebuild.zip, and I think it won't handle the html static image dumps. Calling fmtr__hdump
may allow the GUI / HTTP_SERVER to show the image, but it won't log the image for the html static image dumper (The main call is here: https://github.com/gnosygnu/xowa/blob/master/400_xowa/src/gplx/xowa/parsers/lnkis/Xop_lnki_wkr.java#L75) . I can alter Xoh_file_wtr__basic to do make this call, but I wanted to reproduce this on my side first.
Generating from wikidata, the randomness potentially produces new images to 'download', the download process runs, and then the wikitext is processed a second time
I tried to debug this further on my side, but with the XOWA GUI and no image databases, all the images on en.wikipedia.org/wiki/Portal:Arts
show (They are "random" so each refresh of the page will download new images from the internet). I'll be downloading de.wikipedia.org sometime tonight, so will take a look at de.wikipedia.org/wiki/Portal:Wikipedia_nach_Themen
. Is that the best page to witness the behavior in the excerpt above?
I believe that this behaviour is 'limited' to xowa-http Due to the complete reprocessing of a page if an image is missing (in Http_server_page.java)
The changes I suggested above seem to work in xowa-http but I forgot to see what impact there would be for xowa-gui (which I think does it a different way)
Attached is a version of Xoh_file_wtr__basic.java that takes account of the application mode This seems to make things OK with xowa-gui
Having been playing with the xowa-gui version and page
en.wikipedia.org/wiki/Portal:Arts
I have noticed some inconsistent behaviour
I start with a fresh build of xowa (xowa_get_and_make.sh) - this deletes all files in the /file/ subdirectory
Start xowa and in Options->Wiki - HTML Databases untick 'Prefer HTML Databases for Read tab' (so as to always use wikitext)
In a new tab request the above page
The page loads and all images load (along with the appropriate text)
However, if within the page, I right click and choose 'Reload Page' the page loads but some images are missing
If I go to the address bar and hit carriage return (or enter), the page loads with all (random) images
My version of xowa exhibits the same problem, however, I have added a line of code to Xof_xfer_queue.java, that indicate which file is being downloaded (System.out.println)
When I use 'Reload Page' no images are downloaded, when I hit enter in the address bar, images are downloaded
Cool. Thanks for the updates. I'm running errands tomorrow, so won't get a chance to review till Thursday morning.
Hey, so I tried it today and couldn't reproduce it.
Maybe this is something to do with your forked changes? Could you try with xowa_get_and_make.sh
? See my steps below.
Thanks!
Let's assume the XOWA root is something like C:\xowa_latest
home
wiki. However, I wanted to simulate as close as possible the original bug report from 3/18sh xowa_get_and_make.sh
xowa_dev.jar
and move it to C:\xowa_latest
java -jar xowa_dev.jar
de.wikipedia.org/wiki/Project:Sandbox
[[Datei:Views of Geneva.jpg|right|150px|Genf]]
[[Datei:Hn-caecilien66-web.jpg|right|150px|Villa Faißt in Heilbronn]]
[[Datei:Collage_of_views_of_Poznań,_Poland.jpg|right|150px|Posen]]
[[Datei:Arrasate-mondragon.jpg|right|150px|Baskenland]]
[[Datei:Akihabara Electric Town 2.jpg|right|150px|Tokio]]
C:\xowa_latest\file
directoryjava -jar xowa_dev.jar --app_mode http_server
de.wikipedia.org/wiki/Project:Sandbox
-> All 5 files get downloaded and showWith the original issue - I had a forked change that shows the problem described (my version allows a 'Show preview' from the xowa-http side)
Most of the time, I try to reproduce these issues with a fresh build with xowa_get_and_make.sh
I agree that following the step described immediately above works fine.
However I have also, in further comments in this post, described other failures (that I believe are related) Specifically my comments on 7th June and 23rd June (Clearing the /file/ cache is an important step)
Have you had an opportunity to try to reproduce those ones?
However I have also, in further comments in this post, described other failures (that I believe are related) Specifically my comments on 7th June and 23rd June (Clearing the /file/ cache is an important step)
Oops. I assumed the first comment was still related to the others. Sorry, my mistake. I should have read the others more closely
Have you had an opportunity to try to reproduce those ones?
I tried now with http://localhost:8080/en.wikipedia.org/wiki/Portal:Arts and see the issue. Let me re-review your commits and work on that next.
Sorry again for not spending a bit more time on going through the other comments. I know how much time you spend on these issues, and the least I could have done was read a little more closely. Will work on this over the next few days. Thanks!
Added commit above. The approach is a bit different, as I ended up adding a new Xoh_wtr_ctx.HttpServer
and used it to handle all the hdump logic.
Also, FWIW, your approach was very clever. I didn't actually realize what you were doing until I re-reviewed your changes today. I think if I had to solve the same problem, I would not have come up with this approach -- which is pretty sad considering I wrote both the hdump code.
Anyway, nice job! Sorry again for the misunderstanding above, but thanks many more times for a great fix!
Having done some further experiments and builds I have noticed a number of tweeks that need consideration
The second pass goes through a (almost) completely built html. This means there are some anchors (\) and image links (\) that have not needed to be considered before (This shows up in the logs) I have made some changes that stop the generation of these messages
I still cannot get enwikivoyage pagebanner images to 'download' properly
Today, however, I have just noticed (its taken this long!) that the Categories section does not display at all
This is due to the fact that the generation of the Categories checks the Hdump status which, now, is always on at that point - hence no Categories
In 400_xowa\src\gplx\xowa\htmls\core\htmls\Xoh_wtr_ctx.java
I have introduced a new Mode check
Mode_is_hdump_only
which just checks the flags {return mode == TID_HDUMP || mode == TID_EMBEDDABLE;}
And changed 400_xowa\src\gplx\xowa\htmls\Xoh_page_wtr_wkr.java
to check that
As described by @Ope30 in #680, the page
de.wikipedia.org/wiki/Portal:Wikipedia_nach_Themen
seems to be inconsistent in displaying imagesI have seen this before in other wikis
Taking this page as an example, the image to the right of Geographie is chosen 'randomly' from a list of 5 (in this case) images. The wikitext is:
If I take just the list of images
and cut out all the rest of the wikitext and replace with these files, when I
Show preview
(Vorschau zeigen
), I get 2 images and three failures