Closed PiaSpriensma closed 2 years ago
@PiaSpriensma If you have the ROOT.war servlet installed under webapps. Then that will take over and forward back into solrwayback (using referer to see what the correct URL is). But if you have proxy taken over at the normal ROOT.war location, then it will not work.
Can you show the HTML snippet where you changed the URL? We are trying to parse all url's and replace them before the HTML is display. But sometimes it is not possible like dynamic javascript etc. And for these leaks it can happen. For all leaks that try to access the live web, they will be caught be the serviceworker running in the browser. I have not been able to have the service worker also catch these URL's to the root, why the ROOT.war is still required. But it seems to happen very very rarely.
<img class=" lazyloaded" src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" data-src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" alt="Partij van de Arbeid – Homepage">
changed to
<img class=" lazyloaded" src="wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" data-src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1656197212.jpeg" alt="Partij van de Arbeid – Homepage">
@PiaSpriensma Can you give me the SolrId of the record? Etc: id:"20220617201326/LtjbgGQW223TdSfQeFobcw==" The src tag of images should already be rewritten when sending the HTML for playback. So I am not sure what is going on.
@thomasegense How can I find the SolrId of the record?
When you have the webpage in the result list, you just unfold the fields.
Or you could give me the url of there record and I can search it myself.
That is this one: 20220626044545/pPPsUOcEpBcruNNtba6zwA==
@PiaSpriensma thanks. Hope I can look into it today. I will be on vacation next 2 weeks.
@PiaSpriensma That ID is not in the warc-file I can download. All resources in the WARC file downloaded seems to be harvested 20220913 (that is almost 3 months later). The id's start with: id:"20220913064343/zgb4+HOwsS+1GMlija5WQw==" etc. Also I can not match the binary payload from the 'pPPsUOcEpBcruNNtba6zwA==' so I do not have an 100% identical resource in the warc-file. now you understand the ID format :)
Can I get the url? Maybe it is harvest in the warc-filer and has the same issue ?
The WARC file is the same as #230 , that one is uploaded 11 August... I don't understand which WARC file you are using. (it is not a problem to wait a few weeks...)
@PiaSpriensma I am not sure it is the correct WARC-file I am downloading: https://drive.google.com/file/d/18HlPVOWFf_QM4cY3BkI3cP4A_7Fu4rng/view
I can not see when it was uploaded, but it was last modified 9.july 2022.
After indexing a *:*
search show 4,809 records in Solr
And a search for:
id:202207
will find all 4809 records. So all resources in the warc-file has been harvested in Juli.
and a search for
id:202206 ( this is the ID prefix from your record)
gives 0 results.
And opening the WARC-file shows the top meta data information:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2022-07-07T11:15:14Z
WARC-Filename: IAH-20220707111514454-00000-1427~deu.ub.rug.nl~8443.warc
WARC-Record-ID:
Again first record is harvest/crawled 07-07.
But if you give me URL maybe it is also in this newer WARC file.
i'm sorry that we have used an older one. the id of this one is: 20220707113438/9ekEyUypwihZ99FVZhNxpg==
I have that record and it is an image ( a teacup) But I can use the reverse image lookup feature to see it is used from 3
different web pages.. I guess it is the main page you see it from: https://www.pvda.nl/
But playback of that page seems perfect to me: http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/
I see the nice teacup (attachment)
When I use 'inspect' in the playback for that image I see:
<img class=" lazyloaded" src="http://localhost:8080/solrwayback/services/downloadRaw?source_file_path=/media/teg/1200GB_SSD/solrwayback_package_4.3.0/indexing/warcs3/pvda.warc&offset=222990662" data-src="http://localhost:8080/solrwayback/services/downloadRaw?source_file_path=/media/teg/1200GB_SSD/solrwayback_package_4.3.0/indexing/warcs3/pvda.warc&offset=222990662" alt="Partij van de Arbeid – Homepage">
The URL has already been replaced in the HTML given to playback. Most HTML elements are fixed this way. More dynamic scripts are fixed "runtime" in the browser. The serviceworker will block access to all domains ourside localhost and rewrite them. And (very rare) some will leak to the root-servlet as in your description and this is fixed in the ROOT.war, but does not work when running reverse proxy mode.
But back to the issue. Can you try inspect the element in playback mode and check again? (in firefox, right click image and select inspect. Then right click the select and 'Edit as Html')
The HTML you described is in the HTML-source code(WARC file), but should be url-replaced during playback as I showed.
When i understand you right, it is the same we did before. The result is: <img class=" lazyloaded" src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1657147771.jpeg" data-src="/wp-content/uploads/cache/_src58df01ab47a08a72d59a6630ea2213ed_pare3f3885f9d535a90a9cb09b7cafc2aa3_dat1657147771.jpeg"
The 3 pictures above the teacup do have the "".../solrwayback/services/downloadRaw?source_file_path=/..." but the wallet, the teacup and the picures below do not show up
@PiaSpriensma I can not really recreate that bug. I have attatched the
web page source-code when visiting this url http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/
Both the src
tag and data-src
has been url-replaced with the local solrwayback url everywhere.
You can see the source code from the playback page or on linux/mac you can download it with:
wget 'http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/'
If the url are not rewritten correct for you there, then something other that the playback engine is wrong. It is worth to try newest build of solrwayback from the master branch, that is the one I am using., But there should not have been tampered with the img src tag for a long ime. The data-src was fixed by the other bug you submittet.
When I download the playback page, I see that the src you mentioned is between before the rules that are already mentioned...
For now I would say: have a nice holiday, we will see when you return
@PiaSpriensma I am back from holiday now. So lets see if we can close this.
When I look at the source code of the playback page http://localhost:8080/solrwayback/services/web/20220707111526/https://www.pvda.nl/
then all img-tags I see has been url replaced to a solrwayback url. And they the image will work.
The only place I can find <img class=" lazyloaded" src="/wp-content/upload
is in the source code of the original warc-record.
Can I see the full source code of the playback ?
@thomasegense is this what you mean? view-source_https_test-archipol.ub.rug.nl_solrwayback_services_web_20220707111526httpswww.pvda.nl_.html.zip
@PiaSpriensma Yes. That helps.
I can see the data-src has not been URL-replaced. This should have been fixed in the last path I gave to @mbreemhaar
Here is one of the img tages on my solrwayback installation:
<img class="lazyload" alt="Partij van de Arbeid – Homepage" title="Partij van de Arbeid – Homepage" src="data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E" data-src="http://localhost:8080/solrwayback/services/downloadRaw?source_file_path=/media/teg/1200GB_SSD/solrwayback_package_4.3.0/indexing/warcs3/pvda.warc&offset=210342747" />
And the same from your source code:
<img class="lazyload" alt="Partij van de Arbeid – Homepage" title="Partij van de Arbeid – Homepage" src="data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22%20viewBox=%220%200%20210%20140%22%3E%3C/svg%3E" data-src="/wp-content/uploads/cache/_src0e44d60eeef90a24a5027c13e7e84d13_pard5b30cab58098f5a2885130cfea61752_dat1657011599.jpeg" />
So running the latest SolrWayback should fix that problem.
You can see your current version from this url:
http://localhost:8080/solrwayback/services/frontend/properties/solrwaybackweb
(maybe replace the localhost with your server)
And look for:
solrwayback.version "4.3.1"
Are you running version 4.3.1?
@thomasegense Yes we do. By looking for the version, this is the outcome: .png","solrwayback.version":"4.3.1-SNAPSHOT",
@PiaSpriensma Thanks. I think I made two snapshot release so I can not see what is included in yours. But I hope I can make you download the 4.3.2-SNAPSHOT which has all the 4.3.1 fixes. I made a download for you here: https://drive.google.com/file/d/1JW4dKEQW1BmFCWldkJBd_53XDgftzeDp/view?usp=sharing
Just replace the war-file in the tomcat/webapps folder and no further changes needed (I hope).
A new text pop-up has been added next to the toolbar. If you want to customize the text add the following property in solrwaybackweb.properties: collection.text.file=/home/test/yourfile.txt You can use HTML codes to format text, but do not include start tag. Just the content.
More changes can be seen here: https://github.com/netarchivesuite/solrwayback/blob/master/CHANGES.md
Let me know if it works.
@thomasegense That works! Thank you very much.
Thanks! I am very pleased to hear that.
The ROOT-servlet would automatic have fixed the leak, but when running a reverse proxy as you do, then the requests will not reach the ROOT-servlet. But early URL-replacing as much as possible is still better.
But there can be leaks generated by dynamic javascript etc. that can also leak to the root-service, and in this case it will not work for you. But these cases a very rare from my experience.
I will close this now.
(I am taking over from @mbreemhaar ) I am running Solrwayback behind a reverse proxy that sends/solrwayback to Solrwayback. This causes a problem where if a link or an image points to the root of a website, it will go to instead of /solrwayback/services/.../. This means that the reverse proxy will not send the request to Solrwayback, but to the root of my website.
A possible solution would be to replace all link addresses in playback that start with '/' with '/'. I tried this by changing the page source in my browser and that seems to work fine.
The warc-file we discovered this issue is the same as we loaded at #230