korymath / talk-generator

talk-generator is capable of generating coherent slide decks based on a single topic suggestion.
MIT License
124 stars 9 forks source link

WikiHow Advanced Search Login Problem #42

Closed twinters closed 5 years ago

twinters commented 5 years ago

For several months, our WikiHow searcher seems to only be able to use the basic search engine, and not the advanced one any more. Not sure if this is fixable, or if they blocked access to it. The scraper just sees a message as if the user wasn't logged in, even though a logged-in session is given.

korymath commented 5 years ago

@twinters is there a unit test which covers this? What is the code to reproduce?

korymath commented 5 years ago
python tests/test_wikihow.py
WARNING: Problem logging in on Wikihow: Advanced Search disabled
..
----------------------------------------------------------------------
Ran 2 tests in 2.520s

OK

Currently there is a wikihow unit test, but it passes and throws a warning if there is a problem logging into Wikihow.

Should we modify this? Perhaps there should be a more specific integration unit test which will ensure that the connection is working?

twinters commented 5 years ago

Yes I have been looking into this a lot, and I'm not sure how the scraper stopped working. It might be that there has been an update in the way WikiHow works. Going to https://www.wikihow.com/index.php?title=Special:Search&title=Special%3ASearch&profile=default&search=cat&fulltext=Search&ss=relevance&so=desc&ffriy=1&ffrin=1&fft=ffta&fftsi= works in the browser as long as you're logged in. So either our log in method is not perfect (anymore), or you suddenly need javascript processing or something to parse the interesting content of page. I looked into adding Etherium, but that would add some complicated dependencies that I don't think would be worth it, and it didn't seem to solve the problem when I gave a shot at implementing it.

I agree that the current test and warning situation is not ideal though, given it is currently a broken feature.

korymath commented 5 years ago

What about using an api wrapper

Python Requests Query

import requests

url = "https://hargrimm-wikihow-v1.p.rapidapi.com/steps"

querystring = {"count":"3"}

headers = {
    'x-rapidapi-host': "hargrimm-wikihow-v1.p.rapidapi.com",
    'x-rapidapi-key': "KEYKEYKEY"
    }

response = requests.request("GET", url, headers=headers, params=querystring)

print(response.text)

Response

{
    "1": "Find a tight red top.", 
    "2": "Purchase or pick good plums.", 
    "3": "Contact your veterinarian if your dog's temperature is higher than 103 degrees Fahrenheit."
}
korymath commented 5 years ago

This returns random steps or images...

twinters commented 5 years ago

Yeah, saw that API earlier: does something much different than we do/need it for. Btw, note that "normal" WikiHow search still completely works in our system, it's just the "advanced search" (that finds more pages, even pages that are not yet approved)

korymath commented 5 years ago

But it looks like it actually might be working...

python tests/test_wikihow.py
No Wikihow Session object in credentials, attempting log in...
Requests login failed. Unable to continue login.
.['ts', 'Communicate with Your Cat', 'Pet a Cat', 'Draw a Cat', 'Draw a Cat Using the Word Cat', 'Be Cat Like', 'Care for a Cat with Feline Leukemia', 'Get a Cat for a Pet', 'Like Cats', 'Discipline Your Cat or Kitten']
.
----------------------------------------------------------------------
Ran 2 tests in 1.685s

OK
korymath commented 5 years ago

Let’s lose the advanced search “fun”ctionality

twinters commented 5 years ago

Yeah good call. Disabled the advanced wikihow functionality as of https://github.com/korymath/talk-generator/commit/213d062a4b8497410c4179337d44265356cc7074