lukasschwab / stackit

StackOverflow queries from the command line
MIT License
314 stars 28 forks source link

use Py-StackExchange rather than requests and bs4 #13

Closed WnP closed 9 years ago

WnP commented 9 years ago

answers are now 140 character long in listing and follow by ... if they are more long

let me know if you think it's a good idea or not

lukasschwab commented 9 years ago

This was a big thing for us––hopefully will improve speed as well.

I'll take a look at this later; there was some issue communication on the Py-StackExchange repo about getting just answer bodies using the API. There's a good chance you've implemented that, but it sounds like a great way to eliminate unnecessary data transfer (the rest of the HTML in the site, as with requests/bs4).

Just dropping this here for my own reference when reviewing––cheers!

WnP commented 9 years ago

I hadn't seen that issue before, but I've read the StackExchange API and Py-StackExchange's source code before implementing this feature, so yes it's implemented indeed ;-)

and yes I think it a more efficient method to deal only with the json API rather than full html requests

lukasschwab commented 9 years ago

@WnP This looks suuuuper clean. Starting testing, hopefully will merge by EOD.

lukasschwab commented 9 years ago

@WnP I dig it, merging.

I will make some small modifications to the way the output is printed myself, just because I think it is easier to implement those changes than communicate them. Very minor, just adding some newlines here and there. Will do new release with those changes.

Also, I notice an anecdotal speed difference... Do you?

Thanks!

WnP commented 9 years ago

@lukasschwab yes the speed difference is anecdotal from client side, let's compare them with this simple script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from timeit import timeit
import stackexchange
from stackexchange import Sort
import bs4
import requests
import html2text

h = html2text.HTML2Text()
term = 'python flask'
API_KEY = "3GBT2vbKxgh*ati7EBzxGA(("
so = stackexchange.Site(stackexchange.StackOverflow, app_key=API_KEY, impose_throttling=True)
questions = so.search_advanced(
    q=term,
    sort=Sort.Votes)
question = None

for q in questions:
    if 'accepted_answer_id' in q.json:
        question = q
        break
else:
    raise Exception('No question found')

def old_way_query(question):
    questionurl = question.json['link']
    answerid = question.json['accepted_answer_id']
    response = requests.get(questionurl)
    soup = bs4.BeautifulSoup(response.text)
    # Focuses on the single div with the matching answerid--necessary b/c bs4 is quirky
    for answerdiv in soup.find_all('div', attrs={'id': 'answer-' + str(answerid)}):
        answertext = h.handle(answerdiv.find('div', attrs={'class': 'post-text'}).prettify())

def new_way_query(question):
    answerid = question.json['accepted_answer_id']
    questiontext = h.handle(so.question(question.id, body=True).body)
    answer = h.handle(so.answer(answerid, body=True).body)

print('old way: %s' % timeit("old_way_query(question)", "from __main__ import question, old_way_query", number=20))
print('new way: %s' % timeit("new_way_query(question)", "from __main__ import question, new_way_query", number=20))

on my laptop using Python 2.7.9 it outputs:

old way: 12.9633069038
new way: 0.572069883347

so in this case (20 executions) it's 22 times faster, the more executions you have the more faster it is

for one execution the difference is really anecdotal

old way: 0.849025964737
new way: 0.543494939804

1.11 times faster ^^

however, these tests are highly dependent on the network connection