marco-c / autowebcompat

Automatically detect web compatibility issues
Mozilla Public License 2.0
34 stars 41 forks source link

Paste webcompat.com bug scrape script and update #284

Closed gabriel-v closed 5 years ago

gabriel-v commented 5 years ago

Pasted the script from https://github.com/webcompat/issue_parser/blob/master/extract_id_title_url.py and ran it once with our requirements and python3.7.3, it works just fine.

Do we want to avoid vendoring this code like this? I could try a pip install git+https://github.com/X/Y.git, but they don't have any package configuration happening in that repo.

The bug count went down from about 10k to 1k with this patch. Do we want to scrape with different parameters?

marco-c commented 5 years ago

It's OK to vendor it, and here are the changes (some might be unnecessary) I made to get more:

diff --git a/extract_id_title_url.py b/extract_id_title_url.py
index ee69033..9d4fbc9 100644
--- a/extract_id_title_url.py
+++ b/extract_id_title_url.py
@@ -20,23 +20,29 @@ import re
 import socket
 import sys
 import urllib2
+import time

 # Config
 URL_REPO = "https://api.github.com/repos/webcompat/web-bugs"
-VERBOSE = True
+VERBOSE = False
 # Seconds. Loading searches can be slow
 socket.setdefaulttimeout(240)

 def get_remote_file(url, req_json=False):
-    print('Getting ' + url)
-    req = urllib2.Request(url)
-    req.add_header('User-agent', 'AreWeCompatibleYetBot')
-    if req_json:
-        req.add_header('Accept', 'application/vnd.github.v3+json')
-    bzresponse = urllib2.urlopen(req, timeout=240)
-    return {"headers": bzresponse.info(),
-            "data": json.loads(bzresponse.read().decode('utf8'))}
+    while True:
+        try:
+            print('Getting ' + url)
+            req = urllib2.Request(url)
+            req.add_header('User-agent', 'AreWeCompatibleYetBot')
+            if req_json:
+                req.add_header('Accept', 'application/vnd.github.v3+json')
+            bzresponse = urllib2.urlopen(req, timeout=240)
+            return {"headers": bzresponse.info(),
+                    "data": json.loads(bzresponse.read().decode('utf8'))}
+        except:
+            print('Wait ten minutes before next request...')
+            time.sleep(600)

 def extract_url(issue_body):
@@ -81,7 +87,7 @@ def extract_data(json_data, results_csv, results_bzlike):
         # Extracting the labels
         labels_list = [label['name'] for label in issue['labels']]
         # areWEcompatibleyet is only about mozilla bugs
-        if any([('firefox' or 'mozilla') in label for label in labels_list]):
+        if True: #any([('firefox' or 'mozilla') in label for label in labels_list]):
             # Defining the OS
             if any(['mobile' in label for label in labels_list]):
                 op_sys = 'Gonk (Firefox OS)'
@@ -142,7 +148,7 @@ def get_webcompat_data(url_repo=URL_REPO):

     Start with the first page and follow hypermedia links to explore the rest.
     '''
-    next_link = '%s/issues?per_page=100&page=1' % (url_repo)
+    next_link = '%s/issues?per_page=100&page=1&filter=all&state=all' % (url_repo)
     results = []
     bzresults = []

@@ -157,7 +163,7 @@ def main():
     results, bzresults = get_webcompat_data(URL_REPO)
     # webcompatdata.csv
     with open('webcompatdata.csv', 'w') as f:
-        f.write("\n".join(results).encode('utf8'))
+        f.write("\n".join(results))
         f.write('\n')
     print("Wrote {} items to webcompatdata.csv ".format(len(results)))
     # webcompatdata-bzlike.json
codecov-io commented 5 years ago

Codecov Report

Merging #284 into master will decrease coverage by 0.75%. The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #284      +/-   ##
==========================================
- Coverage   15.54%   14.79%   -0.76%     
==========================================
  Files          13       14       +1     
  Lines        1885     1981      +96     
  Branches      327      344      +17     
==========================================
  Hits          293      293              
- Misses       1590     1686      +96     
  Partials        2        2
Impacted Files Coverage Δ
extract_id_title_url.py 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a9776f2...2d69474. Read the comment docs.

marco-c commented 5 years ago

webcompatdata-bzlike.json is being used, we can't remove it yet. You can add an explanation to the README on how to re-generate it.