servo / servo-warc-tests

Test Servo on Web Archive snapshots of real web sites
Mozilla Public License 2.0
11 stars 18 forks source link

Add web archive for instagram.com #42

Closed asajeffrey closed 6 years ago

asajeffrey commented 6 years ago

37

pduzinki commented 6 years ago

I'd like to try this one :)

asajeffrey commented 6 years ago

Go for it!

pawan92 commented 6 years ago

is this still open?

asajeffrey commented 6 years ago

@pduzinki are you still working on this, or can @pawan92 give it a shot?

pduzinki commented 6 years ago

Sorry, didn't really have time recently. @pawan92 can give it a shot, sure :)

pawan92 commented 6 years ago

i'll give it a shot! hope to have something in by end of the week

pawan92 commented 6 years ago

apologize for lateness. tried following the webservice steps and I am having some issues installing using virtualenv. i am running on mac

asajeffrey commented 6 years ago

Well, it should work without virtualenv, I usually recommend virtualenv just to avoid installing stuff globally.

pawan92 commented 6 years ago

hmm ok let me try that out

pawan92 commented 6 years ago

is wayback something seperate we need to install?

asajeffrey commented 6 years ago

wayback should be installed when you run pip install git+https://github.com/ikreymer/pywb.git

pawan92 commented 6 years ago

i did that and it still says command not found

asajeffrey commented 6 years ago

It's probably installed it somewhere that isn't on your PATH. Try giving the full path name for the command.

cubetastic33 commented 6 years ago

Can I work on this?

asajeffrey commented 6 years ago

Go for it!

cubetastic33 commented 6 years ago

Using the instructions, I tried to play the existing archive shown in the example, but it isn't working. The terminal output is:

$ proxychains ~/servo/mach run -r --certificate-path proxy-certs/pywb-ca.pem https://www.wbez.org/
ProxyChains-3.1 (http://proxychains.sf.net)
|DNS-request| www.wbez.org 
|S-chain|-<>-127.0.0.1:9050-<--timeout
|DNS-response|: www.wbez.org does not exist

and this is the servo window that opened: image

Edit:

The same thing happens when I try to record https://www.example.com or https://github.com

asajeffrey commented 6 years ago

That's odd. What output do you get in the window running wayback? It should be something like:

$ wayback --proxy WBEZ --port 8321
[Errno 2] No such file or directory: './config.yaml'
2018-09-05 10:11:34,780: [INFO]: Proxy enabled for collection "WBEZ"
2018-09-05 10:11:34,845: [INFO]: Starting Gevent Server on 8321
127.0.0.1 - - [2018-09-05 10:11:51] "POST /WBEZ/resource/postreq?matchType=exact&url=https%3A%2F%2Fwww.wbez.org%2F&closest=now HTTP/1.1" 200 16439 0.003355
::ffff:127.0.0.1 - - [2018-09-05 10:11:52] "CONNECT 34.205.97.32:443 HTTP/1.0" 200 94 0.328500
127.0.0.1 - - [2018-09-05 10:11:52] "POST /WBEZ/resource/postreq?matchType=exact&url=https%3A%2F%2Fwww.googletagmanager.com%2Fgtm.js%3Fid%3DGTM-57WD36%26gtm_auth%3DtmHQZJnpm8lhBlUlUxcf9A%26gtm_preview%3Denv-1%26gtm_cookies_win%3Dx&closest=now HTTP/1.1" 200 84206 0.003478
Dir collections/WBEZ/indexes/ unchanged
127.0.0.1 - - [2018-09-05 10:11:52] "POST /WBEZ/resource/postreq?matchType=exact&url=https%3A%2F%2Fwww.wbez.org%2Fcss%2Fcpm-1515786996172.css&closest=now HTTP/1.1" 200 53965 0.000919
Dir collections/WBEZ/indexes/ unchanged
...
cubetastic33 commented 6 years ago

@asajeffrey

(venv) cubetastic@(my hostname):~$ wayback --proxy Example --live --proxy-record --autoindex --port 8321
[Errno 2] No such file or directory: './config.yaml'
2018-09-05 21:57:48,566: [INFO]: Proxy recording into collection "Example"
2018-09-05 21:57:48,571: [INFO]: Auto-Indexing Enabled on "/home/cubetastic/collections", checking every 30 secs
2018-09-05 21:57:48,655: [INFO]: Starting Gevent Server on 8321
2018-09-05 21:57:48,656: [INFO]: Checking Collection: Example
2018-09-05 21:57:48,656: [INFO]: Checking Collection: GitHub
2018-09-05 21:58:18,673: [INFO]: Checking Collection: Example
2018-09-05 21:58:18,674: [INFO]: Checking Collection: GitHub

and

cubetastic@(my hostname):~$ proxychains ~/servo/mach run -r --certificate-path proxy-certs/pywb-ca.pem https://example.com/
ProxyChains-3.1 (http://proxychains.sf.net)
|DNS-request| example.com 
|S-chain|-<>-127.0.0.1:9050-<--timeout
|DNS-response|: example.com does not exist

and image

cubetastic33 commented 6 years ago

@asajeffrey Do you have any idea on what is causing the error?

asajeffrey commented 6 years ago

@cubetastic33 sorry about the delay getting back to you.

After a bit of digging, I can replicate this error by removing the proxychains.conf file, so I suspect what is going on is that for some reason proxychains isn't picking up the conf file. Are you running this command from inside the servo-warc-tests directory?

cubetastic33 commented 6 years ago

@asajeffrey No. As you can see in the output I have shown above, both of them are executed from the home directory, but the first one is inside a virtual env.

asajeffrey commented 6 years ago

Ah, you need to run the commands from the servo-warc-tests directory, so that proxychains will find its config file.

cubetastic33 commented 6 years ago

@asajeffrey I'm sorry, but where exactly would the directory you're talking about be?

asajeffrey commented 6 years ago

The clone of the servo-warc-tests repo.

cubetastic33 commented 6 years ago

@asajeffrey Great! I recorded instagram.com. I did get an error though: ERROR 2018-09-07T15:47:58Z: script::dom::bindings::error: Error at https://staticxx.facebook.com/connect/xd_arbiter/r/0P3pVtbsZok.js?version=42#channel=f187fab468e0166&origin=https%3A%2F%2Fwww.instagram.com:60:2139 /https?/.exec(...) is null but that just seems to be a JS error because of instagram. Now what should I do?

asajeffrey commented 6 years ago

You can ignore that error, Once you've recorded the archive, add it to the ARCHIVES file, then create a PR.

cubetastic33 commented 6 years ago

@asajeffrey I don't really get it. The ARCHIVES file just seems to be a list of all the completed sites. Don't I have to show my recorded archive as well? Also, 360.cn is there in the ARCHIVES file, but the issue is still open and it is still unchecked in the main issue (#37).

asajeffrey commented 6 years ago

Yes, add an entry Instagram: https://instagram.com/ to the ARCHIVES file, so the archive will be tested each night. (Assuming you recorded https://instagram.com/ in an archive called Instagram.)

cubetastic33 commented 6 years ago

@asajeffrey Who will be testing the archive each night?

asajeffrey commented 6 years ago

It's run as part of our nightly testing, the results are at https://servo.org/dashboards/

cubetastic33 commented 6 years ago

@asajeffrey Then what exactly is my role in it? I mean - if all I had to do was to add that one line in ARCHIVES, anybody could do it! So - what was the reason I had to record it myself? Sorry for asking so many questions - I just didn't understand...

asajeffrey commented 6 years ago

The nightly job records the performance against the recorded archive. Recording the archive itself is still a manual process.

cubetastic33 commented 6 years ago

@asajeffrey so, doesn't that mean my Archive should be included somewhere in this repository? However, I'm just adding a line to the ARCHIVES file!

asajeffrey commented 6 years ago

IRC chat: https://mozilla.logbot.info/servo/20180907#c15281406

cubetastic33 commented 6 years ago

@asajeffrey How do you close this issue?