html parsing - Githubissues

ironman5366 commented 8 years ago

I should let it be known that on the google searches that is W.I.L.L's last resort so to speak, parsing is horrible. The very basic solution implemented now is just an example of what I want it to be at some point in the future. If anybody is skilled in parsing html, let me know and we can work on this together. Otherwise I'll see what I can do, but it's not my area of expertise.

Rich700000000000 commented 8 years ago

I've actually written tons of web scrapers in python and ruby, what do you need extracted?

Google searches are tricky because the google homepage has so much weird javascript and active things.

ironman5366 commented 8 years ago

That would be awesome if you could help. So basically, I want to try to find a direct answer to a question from html. When W.I.L.L can't find an answer from wolframalpha or wikipedia he searches google and gets the html from the top result. I'd like to parse that. Right now as an example I just look for a

tag (I know it's iterrible, I just included it as an example of what I want it to be eventually). You can find the code in plugins/search/search.py

Rich700000000000 commented 8 years ago

Just so I understand the steps:

The user submits a query to WILL.
Will checks WolframAlpha for an answer.
If WA has no answer, it checks Wikipedia.
If Wikipedia has no answer, it just googles the question.

and to expand on 4:

WILL finds no response from Wikipedia, and so googles the question.
WILL extracts the HTML from the page.
WILL parses the HTML for the top link.
WILL gives the top link to the user.

Is this correct?

ironman5366 commented 8 years ago

That's the gist of it yes. Although technically, he searches Google before Wikipedia and if Wikipedia is in the top 5 results he checks Wikipedia. Otherwise he goes straight to Google.

Rich700000000000 commented 8 years ago

Got it.

And looking over your other answer, do I edit the one in plugins or built-ins? I was originally looking at the one in built-ins, but was that wrong? Search isn't built in?

ironman5366 commented 8 years ago

Builtins folder is useless at this point and was used in an older build. I was keeping it around because it still had some modules I hadnt switched over into the new framework but I think it can be removed at this point. I'll remove it from the repo when I get a chance.

Rich700000000000 commented 8 years ago

Got it. Forking now.

Also, two last things just for future reference:

Does it have to be google, or are other search engines OK? Other search engines may have simpler API's.
Does it have to be python 2.7, or is python3 OK?

ironman5366 commented 8 years ago

It was coded using Google and I've found it to be fairly simple as I'm not using their api but just urllib to use Ajax. I have nothing against using something else if it's better though. It should probably be python 2.7 since that's what the frameworks written in and there's a lot of encoding stuff in the search module that I've optimized for 2.7 and would have to be changed for 3.

krmaxwell commented 8 years ago

Curious - why scrape instead of just using the API?

ironman5366 commented 8 years ago

@krmaxwell To be honest, when I was writing this I just googled python google search and this was the first thing that I found. I haven't seen the api but if it's better I'd be happy to switch.

tanmoydeb07 commented 8 years ago

"Does it have to be google, or are other search engines OK? Other search engines may have simpler API's." ...we may try Duck Duck Go, it is open source and exposed APIs

ironman5366 commented 8 years ago

any apis could work, I'm open to suggestions

ironman5366 / W.I.L.L

html parsing #18