Wayback machine image URLs still loading images from original Amazon S3 URL

jywarren commented 1 year ago

I found a strange issue when I pointed at a collection of JSON files which have had images routed to the Internet Archive's Wayback Machine caches.

As you can see, the image links are routed to Wayback URLs: https://ia601603.us.archive.org/20/items/mapknitter-wayback/ceres--2.json :

i.e.: https://web.archive.org/web/0id_/https://s3.amazonaws.com/grassrootsmapping/warpables/305268/PuglisiTerrazzeHaghiaTriadaCretaAntica2007-28.jpg

However, when I actually load a page like this, somehow it still loads images directly from Amazon s3, not the Internet Archive:

https://publiclab.github.io/Leaflet.DistortableImage/examples/archive?json=https://archive.org/download/mapknitter-wayback/ceres--2.json

I inspected in the console and still can't figure it out.

@segun-codes @7malikk I was curious, if you had an interest in this, what do you think is happening here? Could any application logic we've written be causing this?

See for example the images at https://publiclab.github.io/Leaflet.DistortableImage/examples/archive?json=https://archive.org/download/mapknitter-wayback/ceres--2.json

still loads https://s3.amazonaws.com/grassrootsmapping/warpables/306187/DJI_1207.JPG

segun-codes commented 1 year ago

Hi @jywarren, I am happy to check this out.

segun-codes commented 1 year ago

Hi @jywarren, I checked the code. The transformation that takes place in the function (in archive.js) below is responsible for the behaviour you are talking about. If my memory serves me right, I think we designed it this way at the time because of issues related to accessing the images programmatically via IA. I also observed the wayback machine itself simply loads the images from s3. What do you think?

// where imageSrc is in format: https://web.archive.org/web/20220803171120/https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg
// returns https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg or
// returns same url unchanged (no transformation required)
function extractImageSource(imageSrc) {
  if (imageSrc.startsWith('https://web.archive.org/web/')) {
    return imageSrc.substring(imageSrc.lastIndexOf('https'), imageSrc.length);
  }
  return imageSrc;
}

Illustration 1:

jywarren commented 1 year ago

Hmm, did this apply only to JSON maybe? Would you mind trying removing that so that it loads directly from the wayback machine?

Thanks for finding that!!!

On Sun, Mar 12, 2023, 2:48 PM Segun @.***> wrote:

Hi @jywarren https://github.com/jywarren, I checked the code. The transformation that takes place in the function (in archive.js) below is responsible for the behaviour you are talking about. If my memory serves me right, I think we designed it this way at the time because of issues related to accessing the images programmatically via IA. I also observed the wayback machine itself simply loads the images from s3. What do you think?

// where imageSrc is in format: https://web.archive.org/web/20220803171120/https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg // returns https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg or // returns same url unchanged (no transformation required) function extractImageSource(imageSrc) { if (imageSrc.startsWith('https://web.archive.org/web/')) { return imageSrc.substring(imageSrc.lastIndexOf('https'), imageSrc.length); } return imageSrc; }

Illustration 1: [image: img] https://user-images.githubusercontent.com/1612359/224565688-4ebdb4cc-6b7b-4ba1-919b-18e1fa965c06.PNG

— Reply to this email directly, view it on GitHub https://github.com/publiclab/Leaflet.DistortableImage/issues/1379#issuecomment-1465271641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF6J3CHQMYKTAMZ5DZ7HTW3YK6VANCNFSM6AAAAAAVQP3O4Y . You are receiving this because you were mentioned.Message ID: @.***>

segun-codes commented 1 year ago

Okay @jywarren, I'll look into this. Many thanks!

jywarren commented 1 year ago

Ah yes. I see - we get this error if we don't do that --

Access to image at 'https://web.archive.org/web/0id_/https://s3.amazonaws.com/grassrootsmapping/warpables/409/IMG_4155.JPG' from origin 'http://localhost:8082' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

I'm not sure... is there another way to access https://web.archive.org/web/20200506081918id_/http://s3.amazonaws.com/grassrootsmapping/warpables/417/img_0135.jpg without CORS issues? Otherwise, we could... upload that entire directory into an Archive collection, and serve it from there.

That is, wayback URLs have CORS limitations, but images in regular archive.org/download/_____ archive.org URLs do not.

segun-codes commented 1 year ago

Yes, I pointed out the fact of CORS limitation in my previous message. It was the reason I fetched from s3 directly.

Okay, but is there something wrong with fetching from s3 given that the legacy json files all have the image sources pointing to s3 either directly or indirectly ? For instance, https://web.archive.org/web/20200506081918id_/http://s3.amazonaws.com/grassrootsmapping/warpables/417/img_0135.jpg simply points to s3 indirectly nothing more.

jywarren commented 1 year ago

Yes, sorry, just agreeing and confirming from my test. Thank you!

The only issue with s3 is that it costs Public Lab money to host -- it's not forever storage. I think perhaps the best choice is to create an archive.org collection and add to this logic in extractImageSource(), where we replace http://s3.amazonaws.com/grassrootsmapping with https://archive.org/download/mapknitter-wayback

I'm working on uploading all the files, but it'll be a while. We can check in here again once it's complete!

segun-codes commented 1 year ago

Ha! okay, I understand now. So archive.org option is definitely the route to take. I will check back then.

jywarren commented 1 year ago

gosh it's going to take a while! it's 631,813 files, i'm only at downloading 3875...

I may try another way at a remote server that's faster... we'll see!

segun-codes commented 1 year ago

Yeah... this has to take a while

Mustafa-Hersi commented 11 months ago

is this issue being worked on?

jywarren commented 11 months ago

Hi, we are still working on uploading the archive.org collection, apologies!

publiclab / Leaflet.DistortableImage

Wayback machine image URLs still loading images from original Amazon S3 URL #1379