Closed ironman5366 closed 7 years ago
I've actually written tons of web scrapers in python and ruby, what do you need extracted?
Google searches are tricky because the google homepage has so much weird javascript and active things.
That would be awesome if you could help. So basically, I want to try to find a direct answer to a question from html. When W.I.L.L can't find an answer from wolframalpha or wikipedia he searches google and gets the html from the top result. I'd like to parse that. Right now as an example I just look for a
tag (I know it's iterrible, I just included it as an example of what I want it to be eventually). You can find the code in plugins/search/search.py
Just so I understand the steps:
and to expand on 4:
Is this correct?
That's the gist of it yes. Although technically, he searches Google before Wikipedia and if Wikipedia is in the top 5 results he checks Wikipedia. Otherwise he goes straight to Google.
Got it.
And looking over your other answer, do I edit the one in plugins or built-ins? I was originally looking at the one in built-ins, but was that wrong? Search isn't built in?
Builtins folder is useless at this point and was used in an older build. I was keeping it around because it still had some modules I hadnt switched over into the new framework but I think it can be removed at this point. I'll remove it from the repo when I get a chance.
Got it. Forking now.
Also, two last things just for future reference:
It was coded using Google and I've found it to be fairly simple as I'm not using their api but just urllib to use Ajax. I have nothing against using something else if it's better though. It should probably be python 2.7 since that's what the frameworks written in and there's a lot of encoding stuff in the search module that I've optimized for 2.7 and would have to be changed for 3.
Curious - why scrape instead of just using the API?
@krmaxwell To be honest, when I was writing this I just googled python google search and this was the first thing that I found. I haven't seen the api but if it's better I'd be happy to switch.
"Does it have to be google, or are other search engines OK? Other search engines may have simpler API's." ...we may try Duck Duck Go, it is open source and exposed APIs
any apis could work, I'm open to suggestions
I should let it be known that on the google searches that is W.I.L.L's last resort so to speak, parsing is horrible. The very basic solution implemented now is just an example of what I want it to be at some point in the future. If anybody is skilled in parsing html, let me know and we can work on this together. Otherwise I'll see what I can do, but it's not my area of expertise.