sugarlabs / browse-activity

Sugar activity to browse the internet; WebKit on GTK on Sugar Toolkit
GNU General Public License v2.0
8 stars 44 forks source link

HTTPError not handled while trying to copy an image. #112

Closed shaansubbaiah closed 4 years ago

shaansubbaiah commented 4 years ago

Sometimes an error is thrown while trying to copy an image.

Reproduce: right click -> copy image

Tested on: Sugar 0.117-3, Debian Bullseye Sugar 0.117, Sugar Live Build

Errors appear while trying on Google Images, but worked without issues on the same image later. Not sure how to reproduce this accurately.

# Sugar 0.117-3, Debian Bullseye

1593790508.439674 DEBUG root: ActivityService.set_active: 1.
Traceback (most recent call last):
  File "/usr/share/sugar/activities/Browse.activity/palettes.py", line 248, in __copy_image_activate_cb
    data = urllib.request.urlopen(self._image_url).read()
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
# Sugar 0.117, Sugar Live Build

** (sugar-activity3:20758): CRITICAL **: 09:12:49.385: gboolean webkit_web_view_can_execut
e_editing_command_finish(WebKitWebView*, GAsyncResult*, GError**): assertion 'g_task_is_valid(result, webView)' failed
failed to create drawable
Traceback (most recent call last):
  File "/usr/src/sugar-activities/Browse.activity/palettes.py", line 248, in __copy_image_
activate_cb
    data = urllib.request.urlopen(self._image_url).read()
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Saumya-Mishra9129 commented 4 years ago

Not able to reproduce with Ubuntu 20.04 sucrose version 0.117.

quozl commented 4 years ago

HTTP Error 403 is a response from the web server. So something about the request was unacceptable. My guess is that there is no referrer header. A web site can check for referrer to make sure the request comes from a browser loading a specific page.

It is better for an image to be copied out of what was already requested and downloaded. Making a subsequent request is bad because (a) it signals to anyone listening that it is happening, (b) the image may have changed, (c) extra time and network resources are used.

shaansubbaiah commented 4 years ago

@Saumya-Mishra9129 thanks for testing.

@quozl I did not get what you meant. I agree that copying a downloaded image is better but errors should not be thrown while trying to copy it directly.

It appears that we don't haven't implemented a method to handle HTTPErrors yet?

quozl commented 4 years ago

@quozl I did not get what you meant. I agree that copying a downloaded image is better but errors should not be thrown while trying to copy it directly.

Error is thrown by the web server, and we don't control the web server. I'm speculating that the web server is coded to do this when a referrer header is missing, or if a referrer header is re-used. Policy of Google Image Search. Can you confirm?

quozl commented 4 years ago

All you have to do is replicate the download using a tool such as wget, curl, or python requests. Google applies different anti-abuse and anti-theft strategies in different countries, according to local laws, regulations and politics.

If we can prove the HTTP Error 403 is expected, the next thing is to either

shaansubbaiah commented 4 years ago

@quozl Initially I tried various links with curl (with and without the referer and user-agent headers) and was able to download the image properly. I guess this was because they were static links from google/the linked site.

Finally, after trying with a google query (which I should have tried earlier) I was able to reproduce the issue on my system. The request fails with status 403 when I'm missing either the user-agent or the referer headers.

Search term was 'gradient'

// successful query (with both 'referer' and 'user-agent')
curl "https://www.google.com/search?q=gradient&espv=2&biw=1366&bih=667&site=webhp&source=lnms&tbm=isch&sa=X&ei=XosDVaCXD8TasATItgE&ved=0CAcQ_AUoAg" --user-agent "Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0" --referer "https://www.google.com/"
quozl commented 4 years ago

Thanks. Good to know. That means the 403 is expected when Referer is absent. I refer you now to my earlier reply. :grin:

shaansubbaiah commented 4 years ago

@Saumya-Mishra9129 I can't reproduce this issue with all images, one image with which I can reproduce this consistently is:

search term: candyfloss right click - copy image

image

shaansubbaiah commented 4 years ago

@quozl turns out it is not due to the referer header but due to the user-agent. Setting the referer header did not prevent the error.

#/usr/src/sugar-activities/Browse.activity/palettes.py __copy_image_activate_cb(self, menu_item)
data = urllib.request.urlopen(self.\_image_url).read()

                    to

req = urllib.request.Request(self.\_image_url)
# req.add_header('Referer', 'http://www.google.com')
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')
data = urllib.request.urlopen(req).read()

The default user-agent header used by urllib is urllib/VVV, where VVV is its version number

Perhaps a check was implemented by the website to combat bots?

quozl commented 4 years ago

Thanks. That's now, but the future could be different. Google will keep changing their code. Our code should not use urllib as a user-agent, but should use the user-agent of the WebKit2 browser. Perhaps it should use WebKit2 to fetch the image rather than Python urllib.

shaansubbaiah commented 4 years ago

After testing today, I cannot reproduce this error exactly on the same image on Sugar Live Build v0.117. Instead of the traceback reported initially in the issue, I now get Terminated by signal 11, pid 1784 activity_id 3e4a49390d0299ef2f366d5e57c79bb29a133a48.

Traceback is the same as reported earlier for Debian Bullseye.

Not sure what changed, tested Browse from master branch on both distributions.

quozl commented 4 years ago

Signal 11 is SIGSEGV, segmentation fault, a read or write to memory out of bounds or protected. Usual next step is to use gdb to obtain a backtrace at C language level. Things that might have changed are journal entries, and the activity instance directory. You should also control for these possibilities.

shaansubbaiah commented 4 years ago

@quozl how do I use gdb with sugar-activity3?

I tried: gdb sugar-activity3 (similar to the example usage of pdb in the wiki ) in the Activity directory and it resulted in errors. "/usr/bin/sugar-activity3": not in executable format: file format not recognized

Also, the activity is written in Python, and you have mentioned to obtain a backtrace at C language level. Is it maybe to check the backtrace of C libraries that were abstracted to Python?

Saumya-Mishra9129 commented 4 years ago

how do I use gdb with sugar-activity3?

@shaansubbaiah You can try these commands to run an activity with gdb. Thanks. I use them on Ubuntu. I am not sure about debian.

gdb python3
run //usr/bin/sugar-activity3
quozl commented 4 years ago

@Saumya-Mishra9129 is correct, although the // is more usually/, but any number of / can be used. Then when the segmentation fault occurs, the (gdb) prompt will appear, and can be used to examine CPU state.

Also, the activity is written in Python, and you have mentioned to obtain a backtrace at C language level. Is it maybe to check the backtrace of C libraries that were abstracted to Python?

Not abstracted, but linked. Python itself is written in C, is linked dynamically to many libraries written in C and sometimes other languages. Because the activity imports WebKit via PyGObject introspection, even more libraries are opened and mapped into virtual memory. Each of these open and map still more.

You can get an idea of the complexity by using Linux kernel APIs.

On Ubuntu 20.04 with Browse, there are about 157 libraries opened.

# readlink /proc/2503/map_files/* | sort | uniq | grep '\.so' | wc --lines
157

Looking just at libraries resolved by PyGObject introspection;

# readlink /proc/2503/map_files/*|grep typelib|cut -f6 -d'/'
SugarGestures-1.0.typelib
SugarExt-1.0.typelib
Rsvg-2.0.typelib
TelepathyGLib-0.12.typelib
Soup-2.4.typelib
WebKit2-4.0.typelib
SoupGNOME-2.4.typelib
JavaScriptCore-4.0.typelib
Atk-1.0.typelib
Gio-2.0.typelib
Pango-1.0.typelib
Gdk-3.0.typelib
Gtk-3.0.typelib
GLib-2.0.typelib
GdkX11-3.0.typelib
GObject-2.0.typelib
GdkPixbuf-2.0.typelib
cairo-1.0.typelib
GModule-2.0.typelib
xlib-2.0.typelib

When you get a SIGSEGV, you have to find out where it happened in that huge collection of executable code.

Then you have to read the source code of where it happened, examine the backtrace to find out the circumstances, and then determine the best place for a fix. Remember, professional programmers will take up a new language in a couple of days, not through any organised learning or plan to self improvement, but just because some code has failed and they have to fix it.

shaansubbaiah commented 4 years ago

@shaansubbaiah You can try these commands to run an activity with gdb. Thanks. I use them on Ubuntu. I am not sure about debian.

gdb python3
run //usr/bin/sugar-activity3

Thank you, that worked. Used run sugar-activity3 after the first line.

Traceback generated:

[Thread 0x7fff96ffd700 (LWP 871) exited]
failed to create drawable
Traceback (most recent call last):
  File "/usr/share/sugar/activities/Browse.activity/palettes.py", line 248, in __copy_image_activate_cb
    data = urllib.request.urlopen(self._image_url).read()
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
[Thread 0x7fff967fc700 (LWP 872) exited]

Thread 1 "B" received signal SIGSEGV, Segmentation fault.
0x00007fffe1ad259d in ?? () from /lib/x86_64-linux-gnu/libwebkit2gtk-4.0.so.37
quozl commented 4 years ago

The active thread is "B" and it failed within WebKit. Three counterfactuals;

In common, get detailed traceback. You'll need to install debug symbols for WebKit and any other libraries involved. If you've not done that before, Google Search it.

quozl commented 4 years ago

After testing today, I cannot reproduce this error exactly on the same image on Sugar Live Build v0.117. Instead of the traceback reported initially in the issue, I now get Terminated by signal 11, pid 1784 activity_id 3e4a49390d0299ef2f366d5e57c79bb29a133a48. Not sure what changed, tested Browse from master branch on both distributions.

Any of those 157 libraries. :grinning: Sometimes I use /var/log/dpkg.log to find out what might have changed during automatic security updates. If no updates were applied, then perhaps the fault is time dependent.

shaansubbaiah commented 4 years ago

@quozl just saw your comments, I'll go through them and report back.

Also, I have found that downloads in Browse are handled using downloadmanager.py which uses the WebKit Download class, whereas copying images which are handled using palettes.py uses urllib. Is there any reason for this?

gdb appears extremely useful, I'll have to take a deeper look. I haven't explored debugging tools fully and use some form of print most of the time.

Saumya-Mishra9129 commented 4 years ago

@Saumya-Mishra9129 is correct, although the // is more usually/, but any number of / can be used. Then when the segmentation fault occurs, the (gdb) prompt will appear, and can be used to examine CPU state.

@quozl Can you tell how to install gdb in Debian 10 buster?. I couldnt find out a way.

shaansubbaiah commented 4 years ago

apt install gdb should work, think that's how I had installed it on Sugar Live Build.

Saumya-Mishra9129 commented 4 years ago

@Saumya-Mishra9129 I can't reproduce this issue with all images, one image with which I can reproduce this consistently is:

search term: candyfloss right click - copy image

I tested with Debian 10 Buster, I am able to reproduce the error.

Saumya-Mishra9129 commented 4 years ago

apt install gdb should work, think that's how I had installed it on Sugar Live Build.

I don't know but It doesn't work for me. It shows-

Package gdb is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source.

E: Package 'gdb' has no installation candidate.
shaansubbaiah commented 4 years ago

Did you run apt update before that? Also make sure that /etc/apt/sources.list is not empty/ has issues for some reason.

My sources.list on the Sugar Live Build (Buster),

#/etc/apt/sources.list
deb http://deb.debian.org/debian/ buster main
deb-src http://deb.debian.org/debian/ buster main

deb http://security.debian.org/debian-security buster/updates main
deb-src http://security.debian.org/debian-security buster/updates main

# buster-updates, previously known as 'volatile'
deb http://deb.debian.org/debian/ buster-updates main
deb-src http://deb.debian.org/debian/ buster-updates main
Saumya-Mishra9129 commented 4 years ago

Also make sure that /etc/apt/sources.list is not empty/ has issues for some reason.

My sources.list on the Sugar Live Build (Buster),

Thanks It helped. I just found out that my sources.list had some issues. It is solved.apt install gdb works now.

shaansubbaiah commented 4 years ago

Just noticed that the User-Agent String is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15 SugarLabs/0.116

_get_sugar_version() in browser.py which is used to generate it returns only even-numbered versions, and all the releases listed on the wiki are even-numbered. A comment in the function also indicates that they are the latest stable versions.

Is v0.117 not considered stable? I haven't found any info on the wiki marking even-numbered versions as stable.

quozl commented 4 years ago

Is v0.117 not considered stable? I haven't found any info on the wiki marking even-numbered versions as stable.

stable and unstable are software manufacturing and release engineering terms. stable means no intent to change; everything important has been fixed. unstable means intent to change; there are important things that are yet to be fixed.

Once upon a time, even numbered releases were stable. I stopped doing that. I also stopped updating the Wiki because nobody was updating it along with me.

Whether 0.118 is stable or unstable depends on whether the release is made with or without everything important being fixed. Sometimes people push me to make a release, citing downstream schedules and their desire not to manage patches. Where possible, I accommodate them, with the result you see above; unstable. I don't respond to demands to work harder; my preference is to work better. The Port to Python 3 is not yet finished for Sugar and activities; that's important to fix.

Also, I have found that downloads in Browse are handled using downloadmanager.py which uses the WebKit Download class, whereas copying images which are handled using palettes.py uses urllib. Is there any reason for this?

I've no idea. Check the commit history with git blame? Most likely the reason is everyone being slack.

_get_sugar_version() in browser.py which is used to generate it returns only even-numbered versions, and all the releases listed on the wiki are even-numbered. A comment in the function also indicates that they are the latest stable versions.

Given that we no longer use even-numbering to indicate stable, this code could change.