netarchivesuite / solrwayback

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
Apache License 2.0
102 stars 21 forks source link

Images not loading when src points to website's root #263

Closed PiaSpriensma closed 2 years ago

PiaSpriensma commented 2 years ago

(I am taking over from @mbreemhaar ) I am running Solrwayback behind a reverse proxy that sends /solrwayback to Solrwayback. This causes a problem where if a link or an image points to the root of a website, it will go to instead of /solrwayback/services/.../. This means that the reverse proxy will not send the request to Solrwayback, but to the root of my website.

A possible solution would be to replace all link addresses in playback that start with '/' with '/'. I tried this by changing the page source in my browser and that seems to work fine.

The warc-file we discovered this issue is the same as we loaded at #230

thomasegense commented 2 years ago

@PiaSpriensma If you have the ROOT.war servlet installed under webapps. Then that will take over and forward back into solrwayback (using referer to see what the correct URL is). But if you have proxy taken over at the normal ROOT.war location, then it will not work.

Can you show the HTML snippet where you changed the URL? We are trying to parse all url's and replace them before the HTML is display. But sometimes it is not possible like dynamic javascript etc. And for these leaks it can happen. For all leaks that try to access the live web, they will be caught be the serviceworker running in the browser. I have not been able to have the service worker also catch these URL's to the root, why the ROOT.war is still required. But it seems to happen very very rarely.

PiaSpriensma commented 2 years ago
<img class=" lazyloaded" src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" data-src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" alt="Partij van de Arbeid – Homepage">

changed to

<img class=" lazyloaded" src="wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" data-src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" alt="Partij van de Arbeid – Homepage">
thomasegense commented 2 years ago

@PiaSpriensma Can you give me the SolrId of the record? Etc: id:"20220617201326/LtjbgGQW223TdSfQeFobcw==" The src tag of images should already be rewritten when sending the HTML for playback. So I am not sure what is going on.

PiaSpriensma commented 2 years ago

@thomasegense How can I find the SolrId of the record?

thomasegense commented 2 years ago

When you have the webpage in the result list, you just unfold the fields.

Or you could give me the url of there record and I can search it myself.

PiaSpriensma commented 2 years ago

That is this one: 20220626044545/pPPsUOcEpBcruNNtba6zwA==

thomasegense commented 2 years ago

@PiaSpriensma thanks. Hope I can look into it today. I will be on vacation next 2 weeks.

thomasegense commented 2 years ago

@PiaSpriensma That ID is not in the warc-file I can download. All resources in the WARC file downloaded seems to be harvested 20220913 (that is almost 3 months later). The id's start with: id:"20220913064343/zgb4+HOwsS+1GMlija5WQw==" etc. Also I can not match the binary payload from the 'pPPsUOcEpBcruNNtba6zwA==' so I do not have an 100% identical resource in the warc-file. now you understand the ID format :)

Can I get the url? Maybe it is harvest in the warc-filer and has the same issue ?

PiaSpriensma commented 2 years ago

The WARC file is the same as #230 , that one is uploaded 11 August... I don't understand which WARC file you are using. (it is not a problem to wait a few weeks...)

thomasegense commented 2 years ago

@PiaSpriensma I am not sure it is the correct WARC-file I am downloading: https://drive.google.com/file/d/18HlPVOWFf_QM4cY3BkI3cP4A_7Fu4rng/view

I can not see when it was uploaded, but it was last modified 9.july 2022.

After indexing a *:* search show 4,809 records in Solr And a search for: id:202207 will find all 4809 records. So all resources in the warc-file has been harvested in Juli. and a search for id:202206 ( this is the ID prefix from your record) gives 0 results.

And opening the WARC-file shows the top meta data information: WARC/1.0 WARC-Type: warcinfo WARC-Date: 2022-07-07T11:15:14Z WARC-Filename: IAH-20220707111514454-00000-1427~deu.ub.rug.nl~8443.warc WARC-Record-ID: Content-Type: application/warc-fields Content-Length: 450

Again first record is harvest/crawled 07-07.

But if you give me URL maybe it is also in this newer WARC file.

PiaSpriensma commented 2 years ago

i'm sorry that we have used an older one. the id of this one is: 20220707113438/9ekEyUypwihZ99FVZhNxpg==

thomasegense commented 2 years ago

I have that record and it is an image ( a teacup) But I can use the reverse image lookup feature to see it is used from 3 different web pages.. I guess it is the main page you see it from: https://www.pvda.nl/

But playback of that page seems perfect to me: http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/

I see the nice teacup (attachment) teacup

When I use 'inspect' in the playback for that image I see: <img class=" lazyloaded" src="http://localhost:8080/solrwayback/services/downloadRaw?source_file_path=/media/teg/1200GB_SSD/solrwayback_package_4.3.0/indexing/warcs3/pvda.warc&amp;offset=222990662" data-src="http://localhost:8080/solrwayback/services/downloadRaw?source_file_path=/media/teg/1200GB_SSD/solrwayback_package_4.3.0/indexing/warcs3/pvda.warc&amp;offset=222990662" alt="Partij van de Arbeid – Homepage">

The URL has already been replaced in the HTML given to playback. Most HTML elements are fixed this way. More dynamic scripts are fixed "runtime" in the browser. The serviceworker will block access to all domains ourside localhost and rewrite them. And (very rare) some will leak to the root-servlet as in your description and this is fixed in the ROOT.war, but does not work when running reverse proxy mode.

But back to the issue. Can you try inspect the element in playback mode and check again? (in firefox, right click image and select inspect. Then right click the select and 'Edit as Html')

The HTML you described is in the HTML-source code(WARC file), but should be url-replaced during playback as I showed.

PiaSpriensma commented 2 years ago

When i understand you right, it is the same we did before. The result is: <img class=" lazyloaded" src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1657147771.jpeg" data-src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1657147771.jpeg"

The 3 pictures above the teacup do have the "".../solrwayback/services/downloadRaw?source_file_path=/..." but the wallet, the teacup and the picures below do not show up

thomasegense commented 2 years ago

@PiaSpriensma I can not really recreate that bug. I have attatched the web page source-code when visiting this url http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/

pvda_localhost.txt

Both the src tag and data-src has been url-replaced with the local solrwayback url everywhere.

You can see the source code from the playback page or on linux/mac you can download it with: wget 'http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/'

If the url are not rewritten correct for you there, then something other that the playback engine is wrong. It is worth to try newest build of solrwayback from the master branch, that is the one I am using., But there should not have been tampered with the img src tag for a long ime. The data-src was fixed by the other bug you submittet.

PiaSpriensma commented 2 years ago

When I download the playback page, I see that the src you mentioned is between before the rules that are already mentioned...

For now I would say: have a nice holiday, we will see when you return

thomasegense commented 2 years ago

@PiaSpriensma I am back from holiday now. So lets see if we can close this.

When I look at the source code of the playback page http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/ then all img-tags I see has been url replaced to a solrwayback url. And they the image will work.

The only place I can find <img class=" lazyloaded" src="/wp-content/upload is in the source code of the original warc-record.

Can I see the full source code of the playback ?

PiaSpriensma commented 2 years ago

@thomasegense is this what you mean? view-source_https_test-archipol.ub.rug.nl_solrwayback_services_web_20220707111526httpswww.pvda.nl_.html.zip

thomasegense commented 2 years ago

@PiaSpriensma Yes. That helps. I can see the data-src has not been URL-replaced. This should have been fixed in the last path I gave to @mbreemhaar Here is one of the img tages on my solrwayback installation: <img class="lazyload" alt="Partij van de Arbeid – Homepage" title="Partij van de Arbeid – Homepage" src="data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E" data-src="http://localhost:8080/solrwayback/services/downloadRaw?source_file_path=/media/teg/1200GB_SSD/solrwayback_package_4.3.0/indexing/warcs3/pvda.warc&amp;offset=210342747" />

And the same from your source code: <img class="lazyload" alt="Partij van de Arbeid – Homepage" title="Partij van de Arbeid – Homepage" src="data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E" data-src="/wp-content/uploads/cache/_src0e44d60eeef90a24a5027c13e7e84d13_pard5b30cab58098f5a2885130cfea61752_dat1657011599.jpeg" />

So running the latest SolrWayback should fix that problem. You can see your current version from this url: http://localhost:8080/solrwayback/services/frontend/properties/solrwaybackweb (maybe replace the localhost with your server) And look for: solrwayback.version "4.3.1"

Are you running version 4.3.1?

PiaSpriensma commented 2 years ago

@thomasegense Yes we do. By looking for the version, this is the outcome: .png","solrwayback.version":"4.3.1-SNAPSHOT",

thomasegense commented 2 years ago

@PiaSpriensma Thanks. I think I made two snapshot release so I can not see what is included in yours. But I hope I can make you download the 4.3.2-SNAPSHOT which has all the 4.3.1 fixes. I made a download for you here: https://drive.google.com/file/d/1JW4dKEQW1BmFCWldkJBd_53XDgftzeDp/view?usp=sharing

Just replace the war-file in the tomcat/webapps folder and no further changes needed (I hope).

A new text pop-up has been added next to the toolbar. If you want to customize the text add the following property in solrwaybackweb.properties: collection.text.file=/home/test/yourfile.txt You can use HTML codes to format text, but do not include start tag. Just the content.

More changes can be seen here: https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md

Let me know if it works.

PiaSpriensma commented 2 years ago

@thomasegense That works! Thank you very much.

thomasegense commented 2 years ago

Thanks! I am very pleased to hear that.

The ROOT-servlet would automatic have fixed the leak, but when running a reverse proxy as you do, then the requests will not reach the ROOT-servlet. But early URL-replacing as much as possible is still better.

But there can be leaks generated by dynamic javascript etc. that can also leak to the root-service, and in this case it will not work for you. But these cases a very rare from my experience.

I will close this now.