singhj / locality-sensitive-hashing

MIT License
28 stars 11 forks source link

Getting started with a development environment #1

Closed singhj closed 10 years ago

singhj commented 10 years ago
  1. Download and install Python 2.7.
  2. Download and install Eclipse. There are many versions. You want this one: Eclipse IDE for Java Developers.
  3. Set up Eclipse to work with Python and Google App Engine. These instructions look good. Eclipse changes a lot so the instructions may not be for the latest version.
  4. The latest versions of eclipse come with the Git plug in already installed. See if you can synchronize with this repository.
  5. The instructions for getting started with Google App Engine (Python) are here
tbrooks007 commented 10 years ago

Is it okay to use PyCharm instead of Eclipse? I already have PyCharm set up to use my python 2.7 virtualenv and I have installed google app eng and can create and run and app from the IDE. I am having trouble installing the pydev plugin into Eclipse. I am getting some error during the install / download process saying that it can't verify the installation and there is also a NullPointerException.

singhj commented 10 years ago

Whatever works for you as far as the IDE is concerned, Teresa.

Congratulations on getting the app to run from IDE.

Best,

J Singh

President Early Stage IT (617) 475-0120 (O) (978) 760-2055 (M) http://www.datathinks.org http://www.earlystageit.com

Join us at the next Boston Cloud Services Meetuphttp://www.meetup.com/Boston-cloud-services/ .

On Sun, Jan 19, 2014 at 9:27 PM, VaderGirl13 notifications@github.comwrote:

Is it okay to use PyCharm instead of Eclipse? I already have PyCharm set up to use my python 2.7 virtualenv and I have installed google app eng and can create and run and app from the IDE. I am having trouble installing the pydev plugin into Eclipse. I am getting some error during the install / download process saying that it can't verify the installation and there is also a NullPointerException.

— Reply to this email directly or view it on GitHubhttps://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-32730327 .

tbrooks007 commented 10 years ago

Hi

I going to work on getting our sample twitter api code to run in google app engine. I'm not sure if you anyone gave it another try but I have been doing a little reading and figured these tips maybe helpful:

must add python module code or symlinks to packages that you have installed

https://developers.google.com/appengine/docs/python/#Python_Pure_Python

example of using python library httplib2 and Google oAuth2 Client

https://github.com/muanis/foursquare-oauth-bootstrap

I'll post some notes once I get this working.

singhj commented 10 years ago

I did give it another try and concluded (perhaps erroneously?) that I didn't have any code to handle the callback from Twitter (callback_URL). So I'm reading up on that and seeing if that's the culprit.

Somewhere I have working code that uses the Facebook oAuth and runs on App Engine. I'll take a look at that if the callback_URL thing yields nothing.

tbrooks007 commented 10 years ago

Ahh okay.

I got the script working in a GAE project locally (running via Pycharm). I kept the printing to stdout and it writes to the console in Pycharm. There are a few things I noticed while testing locally.

  1. When running the code in the project unlike when I'm running the script from the command line it wasn't able to read the twitter api key environment variables I created. So to quickly get around that I just hard coded them in the script for quick testing.
  2. Technically GAE doesn't support long-living HTTP requests (did some research on this) but I was able to get get my GAE project working locally by doing the following:
    • Ran GAE project locally via Pycharm
    • Go to the localhost url for app engine...hit the url twice (not sure why this is the case honestly) then in the Pycharm console I am able to see tweets from the api that the script prints to stout.

Note, because I can only see the tweets in the console since I'm writing them to stdout...Mayve this is similar to the behavior you were seeing? I haven't tried pushing and running the code in the cloud yet.

I will push my Pycharm GAE project as another example. I will remove the hardcoded apis keys of course :)

tbrooks007 commented 10 years ago

After doing a bit more digging on using twitter's streaming api on google app engine it looks like it isn't supported because GAE's implementation of urlib in the urlfetch api does not support sockets and hence doesn't support persisted connections which the twitters streaming api needs.

I'm thinking that perhaps the reason my GAE project still worked locally perhaps was because it was using my local virtual 2.7 python env's version of urllib?

Since there is no way to poll the public twitter sample stream if we still wanted to use the twitter sample stream as our data source we would need to wrap our calls to the twitter streaming api in a process on another (non GAE box) that dumps the tweets to somewhere (database, small one node Solr instance, Elasticsearch cluster etc)..then we just need to set up some sort of endpoint that our GAE app can hit to get the tweets. We can definitely do polling on GAE, they also offer the Channel API but that only allows for a client to GAE server persisted connection not sure this will work with third party apis that require persisted connections.

If you guys like this idea...I volunteer to set up a micro AWS box to host the end point for our twitter streaming api calls.

Related links...

https://groups.google.com/forum/#!topic/google-appengine/l0FotoLPRso https://groups.google.com/forum/#!topic/google-appengine/CMg6BkhT0_c https://dev.twitter.com/discussions/18339

wschwerdt commented 10 years ago

Sounds like getting real time data in finance. In theory simple, in practice a permanent hassle.

I like the idea of wrapping.

I'm in NH skiing this week. Will be more active next week. G finished setting up my laptop and deployed the fake GAE test app.

--Wolfgang

Sent via the Samsung Galaxy S® III mini, an AT&T 4G LTE smartphone

-------- Original message -------- From: VaderGirl13 notifications@github.com Date:02/17/2014 2:48 PM (GMT-05:00) To: singhj/locality-sensitive-hashing locality-sensitive-hashing@noreply.github.com Subject: Re: [locality-sensitive-hashing] Getting started with a development environment (#1)

After doing a bit more digging on using twitter's streaming api on google app engine it looks like it isn't supported because GAE's implementation of urlib in the urlfetch api does not support sockets and hence doesn't support persisted connections which the twitters streaming api needs.

I'm thinking that perhaps the reason my GAE project still worked locally perhaps was because it was using my local virtual 2.7 python env's version of urllib?

Since there is no way to poll the public twitter sample stream if we still wanted to use the twitter sample stream as our data source we would need to wrap our calls to the twitter streaming api in a process on another (non GAE box) that dumps the tweets to somewhere (database, small one node Solr instance, Elasticsearch cluster etc)..then we just need to set up some sort of endpoint that our GAE app can hit to get the tweets. We can definitely do polling on GAE, they also offer the Channel API but that only allows for a client to GAE server persisted connection not sure this will work with third party apis that require persisted connections.

If you guys like this idea...I volunteer to set up a micro AWS box to host the end point for our twitter streaming api calls.

Related links...

https://groups.google.com/forum/#!topic/google-appengine/l0FotoLPRso https://groups.google.com/forum/#!topic/google-appengine/CMg6BkhT0_c https://dev.twitter.com/discussions/18339

— Reply to this email directly or view it on GitHub.

tbrooks007 commented 10 years ago

Indeed. Have fun skiing...I am not very jealous of you :)

If everyone else thinks the wrapping is a reasonable idea. I can get started on that this week. Then move on to stubbing out the pipeline we discussed in our last meeting.

singhj commented 10 years ago

I think if we have to set up a separate box, then GAE is just not right for this framework. We may as well set up django on AWS and use it and totally give up on GAE.

But maybe not.

That SO response from Nick Johnson is old. GAE has changed a lot since the time he was involved with it. I found this post which seems to suggest that people are having some success.

I may get a chunk of time this weekend to try it. And if it doesn't work, then we just bail and don't look back, I think.

tbrooks007 commented 10 years ago

Okay sounds good to me. I have looked at that Stackover flow post a few times and there is no indication that the person asking the question was able to get the streaming working. Though the post is asking about the streaming api (using tweepy) if you look at the comments no one has ever tried the streaming api with GAE, even the example given in the link to the git repo was not an example of using the streaming api. There other other features that can be used from both tweepy and twthyon just as polling for user account tweets that will work well with GAE. The last post I saw that said there was no support for persisted HTTP connections was from 2013. Also after looking over the GAE documentation I didn't see anything especially with urlfetch api that mentioned it's support for sockets.

Like I said I had some success with my local account but I think it's because I was using my own environment's version of urlib which supports sockets. It would be nice if we could get this working with GAE since they give us some free stuff.

singhj commented 10 years ago

In that case, let's dump it.

I'll put up a Django instance in AWS -- probably tomorrow -- unless you want to jump on it today.

tbrooks007 commented 10 years ago

I can do something on my box or on a new instance. I doubt I'll be able to get to it today but likely tomorrow. I should be able to stub out some stuff for our app as well.

Few questions before getting started:

  1. What do you want to trigger calling the streaming api? Schedule cron job? RESTful api? I was thinking we could just set up a script that wakes up and gets tweets every so often and dumps them somewhere for later use.
  2. Should we store/dump the tweets we get? If so would you be into using Elasticsearch? If we had this set up we could easily test and re-test our application using tweets previously mined from the stream. We could store any other data we want there as well. Just a simple single node cluster for now.
  3. If we only need a simple web client. Is it okay to try lighter-weight python web framework? I'm fine with Django as well.
wschwerdt commented 10 years ago

Gosh, this is becoming an expedition into uncharted technical territory for the simple-minded statistician. But let's go, the more I learn the better.

@ 1: I think regular cronjob to build-up a continuously expanding repository is best.

@2: Elasticsearch (never hears of it) looks interesting to me.

@3: no opinion. What is Django.

Sorry for being so unhelpful. My high time will come when we get to the actual algorithm...

:-)

Sent via the Samsung Galaxy S® III mini, an AT&T 4G LTE smartphone

-------- Original message -------- From: VaderGirl13 notifications@github.com Date:02/18/2014 5:51 PM (GMT-05:00) To: singhj/locality-sensitive-hashing locality-sensitive-hashing@noreply.github.com Cc: wschwerdt wolfgang.schwerdt@gmail.com Subject: Re: [locality-sensitive-hashing] Getting started with a development environment (#1)

I can do something on my box or on a new instance. I doubt I'll be able to get to it today but likely tomorrow. I should be able to stub out some stuff for our app as well.

Few questions before getting started:

What do you want to trigger calling the streaming api? Schedule cron job? RESTful api? I was thinking we could just set up a script that wakes up and gets tweets every so often and dumps them somewhere for later use. Should we store/dump the tweets we get? If so would you be into using Elasticsearch? If we had this set up we could easily test and re-test our application using tweets previously mined from the stream. We could store any other data we want there as well. Just a simple single node cluster for now. If we only need a simple web client. Is it okay to try lighter-weight python web framework? I'm fine with Django as well. — Reply to this email directly or view it on GitHub.

tbrooks007 commented 10 years ago

LOL! We need everyone's know how...statistics isn't a simple subject by any means :) I don't think you are being unhelpful. Just want to make sure everyone is okay with what we are preposing.

Thanks for the feedback...as for your questions:

  1. Elasticsearch - Is a distributed search engine that allows for near real-time indexing. It is very easy to configure, set up and query (though the DSL has a bit of a learning curve). https://github.com/elasticsearch/elasticsearch
  2. Django is a web framework for python.
sashaffer commented 10 years ago

Were you thinking about using Flask as a lighter-weight python web framework alternative to Django?

Thanks, -Scott

On Wed, Feb 19, 2014 at 10:02 AM, VaderGirl13 notifications@github.comwrote:

LOL! We need everyone's know how...statistics isn't a simple subject by any means :) I don't think you are being unhelpful. Just want to make sure everyone is okay with what we are preposing.

Thanks for the feedback...as for your questions:

1.

Elasticsearch - Is a distributed search engine that allows for near real-time indexing. It is very easy to configure, set up and query (though the DSL has a bit of a learning curve). https://github.com/elasticsearch/elasticsearch 2.

Django is a web framework for python.

Reply to this email directly or view it on GitHubhttps://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-35507272 .

tbrooks007 commented 10 years ago

Wasn't thinking of Flask specifically but that could be one option. I have only used Django (with limited use) but was thinking that it might be overkill for our purposes but nothing wrong with Django.

Have you used Flask or any other python web frameworks?

sashaffer commented 10 years ago

I haven't personally used Flask, it's been on my radar recently because a coworker showed me it for a project he was working on and I attended a Meetup a few months back where people presented use cases that leveraged Flask. I was also thinking Django might be overkill for what we're doing.

On Wed, Feb 19, 2014 at 10:54 AM, VaderGirl13 notifications@github.comwrote:

Wasn't thinking of Flask specifically but that could be one option. I have only used Django (with limited use) but was thinking that it might be overkill for our purposes but nothing wrong with Django.

Have you used Flask or any other python web frameworks?

Reply to this email directly or view it on GitHubhttps://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-35513271 .

tbrooks007 commented 10 years ago

Sweet! I J is okay with this trying out Flask I'm down. In the mean time I'll set up the EC2 instance to host the twitter cron job. I'll also start stubbing out some stuff for the library...I'll push to the repo for feedback. Should be able to work on this tonight.

singhj commented 10 years ago

I'm totally OK with Flask.

tbrooks007 commented 10 years ago

Yay! Okay cool. I'll send some info tonight on where I am on setting up the box.

singhj commented 10 years ago

Got somewhere!

I was able to fetch tweets from within the Google App Engine environment using Tweepy. The code is a little convoluted at the moment, and has the remnants of the App Engine guestbook application and a whole bunch of stuff we don't need, but it has been checked in.

The instructions are available in the README.

tbrooks007 commented 10 years ago

Yay! That's awesome!

Quick question after looking over read_tweepy.py it looks like you are using the tweepy public_timeline() which gives 20 new tweets every 60 seconds according to the api? Do we not want to use the sample firehouse anymore? Tweepy also has a streaming api. If we don't want to use the streaming api I'll hold off on setting up the scripts on my aws box and will focus on making sure I can run your GAE project.

http://pythonhosted.org/tweepy/html/api.html#API.public_timeline http://answers.oreilly.com/topic/2605-how-to-capture-tweets-in-real-time-with-twitters-streaming-api/

singhj commented 10 years ago

@tbrooks007, I think we do want to use the streaming API. I just didn't get that far yesterday. One of the imports in Tweepy was broken and I ended up fetching an older version of streaming.py. Not sure what impact that has. I did end up raising an issue on tweepy and, late last night, the author came back with a suggestion on how to get around it. In other words, what we have in streaming.py is not consistent with the rest, so getting it to work might not be a slam dunk — sigh.

That O'Reilly article is a great find.

Our idea of having pluggable modules might also extend to the data collection part of the equation and support a GAE version and another one that runs on AWS. But there is another side: we can distract ourselves with all these frameworks and things and never get to the meat of what we're trying to accomplish. What do you think?

It feels like we have some momentum and we have learned a lot in the last few weeks, so why don't I write something up about our vision and how we might be able to accomplish it? And meet next week? Are you going to the meetup? Perhaps we can meet after it ends?

tbrooks007 commented 10 years ago

Cool, thanks for checking with tweepy author. Yep I totally agree with getting distracted with frameworks and their nuances. It is easy to get down in the weeds and never get the real project done. We do have momentum and I think writing up something regarding our vision would be awesome. I think that would help a lot. I can't meet next week because I'll be traveling to NYC for work. I will be back Sunday March 3rd. I am available this Sunday though, even if its just for google hang out or skype chat.

I maybe going to the meet up tonight but it really depends on how work goes today. If I can get out of the office on time I'll be there. I'll email you to let you know if I can make later in the day.

singhj commented 10 years ago

Josh (the author of tweepy) reminded me that App Engine now supports sockets. So turning sockets on will help us with streaming anyway.

This Sunday is too soon — I won't have written up my stuff. Let's plan on talking next Sunday (3/1) by Skype or Google Hangout. @tbrooks007, will you be back in town by 3:30 that day?

Scott, we sometimes use email for communication and I don't have yours. LMK please.

tbrooks007 commented 10 years ago

I maybe back by 3:30 but I also have to pick up my dog from his daycare/boarding place. I'd say more like 5PM.

singhj commented 10 years ago

I have commitments starting at 5:00. Let's do Tuesday or Wednesday evenings, 3/4 or 3/5.

J Singh

President Early Stage IT (617) 475-0120 (O) (978) 760-2055 (M) http://www.datathinks.org http://www.earlystageit.com

Join us at the next Boston Cloud Services Meetuphttp://www.meetup.com/Boston-cloud-services/ .

On Thu, Feb 20, 2014 at 4:34 PM, VaderGirl13 notifications@github.comwrote:

I maybe back by 3:30 but I also have to pick up my dog from his daycare/boarding place. I'd say more like 5PM.

— Reply to this email directly or view it on GitHubhttps://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-35671625 .

tbrooks007 commented 10 years ago

Both Monday (3/4) and Tuesday (3/5) work for me.

tbrooks007 commented 10 years ago

Did we ever settle on a day and time for the meeting next week?

singhj commented 10 years ago

We didn't settle on a date and time for meeting. Any preference between Tuesday or Wednesday of this week?

plarkoski commented 10 years ago

Wednesday would be much better for me because I have interviews all day on Tuesday and Wednesday. Patricia

On Mar 3, 2014, at 7:37 AM, singhj wrote:

We didn't settle on a date and time for meeting. Any preference between Tuesday or Wednesday of this week?

— Reply to this email directly or view it on GitHub.


Patricia Voll Larkoski Ph.D. Applied Physics Stanford University

phone: 503-860-3244 patricialarkoski@gmail.com pvoll@alumni.stanford.edu

tbrooks007 commented 10 years ago

I can do either day...Wednesday would be good for me though.

singhj commented 10 years ago

Hi everyone, This week I'm getting killed with a bunch of deadlines so would prefer to meet on Monday 3/10 at 7:00 in Davis Square. Does Diesel work for everyone?

Best,

wschwerdt commented 10 years ago

Suits me very well. I am alone with the kids this week and coudl anyway not make it in the evenings.

--Wolfgang

Von: singhj [mailto:notifications@github.com] Gesendet: 05 March 2014 08:47 An: singhj/locality-sensitive-hashing Cc: wschwerdt Betreff: Re: [locality-sensitive-hashing] Getting started with a development environment (#1)

Hi everyone, This week I'm getting killed with a bunch of deadlines so would prefer to meet on Monday 3/10 at 7:00 in Davis Square. Does Diesel work for everyone?

Best,

— Reply to this email directly or view it on GitHub https://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-36743123 . https://github.com/notifications/beacon/6421118__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcwOTY0NjQxOSwiZGF0YSI6eyJpZCI6MjM2NzEyNDR9fQ==--784d0f0d94a7b519ce4d28098b3a6d0636a073bb.gif

sashaffer commented 10 years ago

Hi,

I can't do this upcoming Monday due to a scheduling conflict, but I'd like to know what comes out of the meeting.

Thanks, -Scott

On Wed, Mar 5, 2014 at 9:18 AM, wschwerdt notifications@github.com wrote:

Suits me very well. I am alone with the kids this week and coudl anyway not make it in the evenings.

--Wolfgang

Von: singhj [mailto:notifications@github.com] Gesendet: 05 March 2014 08:47 An: singhj/locality-sensitive-hashing Cc: wschwerdt Betreff: Re: [locality-sensitive-hashing] Getting started with a development environment (#1)

Hi everyone, This week I'm getting killed with a bunch of deadlines so would prefer to meet on Monday 3/10 at 7:00 in Davis Square. Does Diesel work for everyone?

Best,

Reply to this email directly or view it on GitHub < https://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-36743123> . < https://github.com/notifications/beacon/6421118__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcwOTY0NjQxOSwiZGF0YSI6eyJpZCI6MjM2NzEyNDR9fQ==--784d0f0d94a7b519ce4d28098b3a6d0636a073bb.gif>

Reply to this email directly or view it on GitHubhttps://github.com/singhj/locality-sensitive-hashing/issues/1#issuecomment-36745968 .

plarkoski commented 10 years ago

Hi all, I can't make it tonight because it is my second wedding anniversary and I'll be having dinner with my husband. Like Scott, I'd like to know the outcome. Thanks, Patty