Closed wking closed 9 years ago
Hi wking,
I am Yashasvi, a computer science undergrad from India. I am pursuing my Bachelors of Technology in Computer Science and Engineering from IIIT Hyderabad.
Regarding my open source experience, I have been a Google summer of code student for the past two years, for Benetech in 2013 and Mozilla in 2014.
I am assuming this is the right place to discuss more about this project.
First of all, congrats to NumFOCUS for being selected as an organisation in GSoC this year.
I want to express my interest in working on this idea. I think I have the required skillls and experience that is needed for this project.
If possible, I would like to discuss more specifically about the project, so that I can build a clear timeline for the project for the purpose of my proposal.
On Tue, Mar 03, 2015 at 01:56:43AM -0800, yashasvi girdhar wrote:
I am Yashasvi, ...
Good to meet you, Yashasvi :).
I am assuming this is the right place to discuss more about this project.
Yup.
If possible, I would like to discuss more specifically about the project, so that I can build a clear timeline for the project for the purpose of my proposal.
I think what parts get implemented when will depend on the student's previous experience. If you're comfortable with a particular web framework already, it shouldn't take too long to knock out the aggregator. If you're new to writing web services, it will take a bit more time ;). If you sketch out your background a bit, focusing on the tools and skills listed in the project idea above, it will help me get a sense of how the project might develop for you. If you have questions about the current idea description, please ask :).
Hi @wking ,
Thanks for your quick response.
First of all, I would like to clear some questions regarding the project :
The reason for using a relational database is that we would be having a pretty structured data to store, right? If we are sure about the relational database, then we’ll have to design the database schema beforehand. For eg. one table contains records of the users, one table contains records of the workshops, in way such that the information stored in them can be joined by the user.
The schema would also depend majorly on : what all type of transactions we would be performing on the database. So, I think we need to figure out that also, as the schema will play a major role in making the query process fast and efficient.
Other thing is : We would be modifying the current scripts to send the results, right? You have indicated “optionally” there, so are you thinking of some separate method to send the results?
And, I observed that the current script provides the user with the solutions if an error is produced while running the script. So, do we have to add/modify something to that part also?
Regarding the skills,
Talking of technologies that you listed, I have used python in some of my projects and can understand and write it with no difficulty.In fact, I have already downloaded the python scripts and understood most of them.
Regarding web frameworks,
I am well experienced in JEE (Java Platform Enterprise Edition). I did a summer internship last year with works applications, Tokyo. In that, I built a standalone web application from scratch (around 2000 lines of code), in which front end was built with JavaServer pages, Javascript and twitter bootstrap, and back end with Java Servlet API. The application was deployed at the end of internship period (8 weeks) and is being used by the people of the company.
I have some experience with web2py framework also. I built a photo blog application with it some time back.
I have experience of working on different API’s and have developed quite an understanding of developing a new API from scratch.
Regarding relational databases, I have developed standalone application for a restaurant that included, first designing a database for a restaurant(I used sql) and then, performing queries on it, provided by the user through an interface. Being an android application developer for the quite some time now, I have also used SQLite many times.
But, I think what’s more important is that I have the experience of learning new technologies when required. In both of my previous gsoc interns, I had to learn completely new APIs and technologies and understand large chunks of unknown code. So, I am ready to learn any new framework for this project, if you prefer one. I don’t think it will take much time to start learning the new technologies and moreover, google also provides a 20 days period between the start of the program and the coding period. I think that would be the perfect time for doing that.
I do understand that if I choose to use a framework with which you are not familiar, I may be on my own for the project and I am prepared for that.
Talking of that, we have to build an interface for the administrators also, right? So, I was wondering what all the inteface would contain.
I think the questions that I have asked would help me greatly to build up my timeline.
P.S. You can find code for all of my projects on my github : https://github.com/itsyash
On Tue, Mar 03, 2015 at 12:22:46PM -0800, yashasvi girdhar wrote:
The reason for using a relational database is that we would be having a pretty structured data to store, right?
Yeah. And it's easier to offload complex queries to the storage engine if you're using a relational database.
If we are sure about the relational database, then we’ll have to design the database schema beforehand.
It's good to pick a system that allows database migrations. For example, Django has 1. That way you can adjust the schema later if you don't get it completely right out of the gate. But you'll need a schema to start with, yes.
For eg. one table contains records of the users, one table contains records of the workshops, in way such that the information stored in them can be joined by the user.
That sounds like amy 2. For this particular application, I'm less interested in who's reporting the results. I expect we'll have an easier time getting submissions if the submissions are at least mostly anonymous. I've sketched the models I think we need in the original spec's “Approach” section.
… what all type of transactions we would be performing on the database. So, I think we need to figure out that also, as the schema will play a major role in making the query process fast and efficient.
I think we'll want a single submission from a swc-installation-test-2.py run, containing the system information and test results it currently collects. That covers getting data in, and probably only needs a single API endpoint. Then we'll want a API and possibly a UI for running queries on the stored data (e.g. “What fraction of last month's submitting hosts had Git < 1.8?” or “What fraction of last year's submitting hosts were running Linux?”).
We would be modifying the current scripts to send the results, right? You have indicated “optionally” there, so are you thinking of some separate method to send the results?
No, I'm thinking we modify those scripts. The optionally is just “I can handle this part if the student only has time to implement the server side of the project.” I don't expect this part to take much time, but it's nice to have a safety valve if we get behind schedule.
I observed that the current script provides the user with the solutions if an error is produced while running the script. So, do we have to add/modify something to that part also?
I don't think we need to adjust the user-facing output from the script, other than to mention “We've submitted these results to https://install.software-carpentry.org/ to help improve our setup instructions.” or something that lets folks know we posted their results.
I am well experienced in JEE (Java Platform Enterprise Edition).
Personally, Java seems a bit heavy for this project, and support for Java isn't great in the free-software community. As far as I know, the only from-source option is IcedTea 3, and that's a big dependency that's not present by default on many Linux distributions (e.g. I'm on Gentoo, and we don't have any Java by default). I'd recommend pushing for something based on a scripting language for this project. If you already have experience with one web application, it shouldn't be hard to pickup another. That cuts both ways though, and I'd follow along with a Java implementation if you land the gig and aren't interested in using a scripting language ;).
I have some experience with web2py framework also. I built a photo blog application with it some time back.
I'd certainly prefer this to a Java solution ;). It seems less popular than Django or Flask (at least going by GitHub stars and contributor counts). Can you explain why you picked it for your photo blog?
But, I think what’s more important is that I have the experience of learning new technologies when required.
This is important ;). There's always more to learn :).
Hi @wking ,
Thanks for asnwering the queries.
I agree with you that java would be a bit heavy for the project.
Django sounds a good option but I suggest you to have a look at web2py. It's more lightweight that Django and I think you will find it a suitable fit for our project. There are several reasons why I chose web2py for my photo blog and I think those reasons can fit for our project also. Some of them are :
it’s based on python, therefore more fast and scalable than a framework like Rails. it’s easy to learn and focuses on rapid development it has quite a reputation for database driven applications ( in reference to our project) follows some good practices such as Model View Controller design and server side form validation. some really good things such as a web-based integrated development environment, web-based management interface and a Database Abstraction Layer that writes SQL for you in real time. It supports many relational databases, as well as migrations also.
It’s not as popular as other python frameworks such as Django or flask, but I think it can provide us what all we need in the project.
On a second note, if you are skeptical about using it, I am ready to go with Django. Going through some docs today, I noticed some similarities between both the frameworks and therefore, it should not take me much time to get a good grip on Django.
Apart from that, I get the part on the API’s. Once we decide the technology, I’ll dig deeper into that to find more about how to implement the API.
P.S. I am trying to understand the second script meanwhile, and am sure that I would be able to modify it in the given time.
Thanks.
On Fri, Mar 06, 2015 at 11:20:38AM -0800, yashasvi girdhar wrote:
There are several reasons why I chose web2py for my photo blog and I think those reasons can fit for our project also. Some of them are :
it’s based on python, therefore more fast and scalable than a framework like Rails.
I doubt there's a huge performance distinction between RoR and the Python frameworks. Any of these should be performant enough for this project.
it’s easy to learn and focuses on rapid development it has quite a reputation for database driven applications ( in reference to our project) follows some good practices such as Model View Controller design and server side form validation. some really good things such as a web-based integrated development environment, web-based management interface and a Database Abstraction Layer that writes SQL for you in real time. It supports many relational databases, as well as migrations also.
I think all of these apply to Django too, with the exception of the web-based IDE. That's not a big win for me, anyway, since I'm very comfortable with my local Emacs. I'd be surprised if a per-framework IDE is worth learning unless you're doing a lot of work in the given framework.
Flask is “bring your own database-abstraction layer”, but I've enjoyed using Flask+SQLAlchemy on previous projects. It's a good fit (in my experience) for API implementations, but if you want integrated form generation and parsing it's probably better to go with Django or web2py. On the other hand, I'd be fine having an API-only webserver in Flask+SQLAlchemy (or any framework) and then having a client-side UI in a JavaScript framework (like AngularJS) for constructing queries and viewing the results.
On a second note, if you are skeptical about using it, I am ready to go with Django. Going through some docs today, I noticed some similarities between both the frameworks and therefore, it should not take me much time to get a good grip on Django.
Yeah, picking up Django on the level needed for this project shouldn't take long. I'd suggest looking it over, but if you end up preferring web2py or another framework that's fine with me.
Hi @wking ,
I am Aditya Narayan. I am currently pursuing my Bachelors in electronics and electrical commuincation engineering.
I have quite a bit of experience working with Django. I have hosted an online judge and a registration portal for a student chapter I am part of. You can find them here and here. I am interested in this project and I believe I have the required skill-set.
I have started working on creating a Django app as a draft. I am currently reading the script. I wish to use the requests module to generate POST requests to the Django app from the script. The Django app would then save the logs/errors to a DB.
Please correct me if my understanding of the approach is flawed.
On Sat, Mar 07, 2015 at 06:54:13AM -0800, Aditya Narayan wrote:
I wish to use the requests module to generate POST requests to the Django app from the script. The Django app would then save the logs/errors to a DB.
That sounds reasonable to me. Point me at your repository if you want more feedback. I'd avoid the requests module though, in favor of the Python standard library's urllib [1,2]. The scripts are running on novice-installed systems, so the fewer dependencies we need, the better.
@wking Thanks for redirecting me to urllib. Using the least number of dependencies was something I didn't consider. Can you please suggest improvements to further augment this project. Are there other use cases where the result-aggregation server would be useful?
On Sat, Mar 07, 2015 at 04:28:26PM -0800, Aditya Narayan wrote:
Can you please suggest improvements to further augment this project. Are there other use cases where the result-aggregation server would be useful?
I'm not sure what you mean. Do you mean “will other projects want to aggregate different things besides installation-test results”? Or “will SWC instructors/admins want to perform other queries besides the ones sketched out in 1”?
My guess for the first is “absolutely”, but I doubt it's worth making the implementation from this project generic enough to handle them. I expect the implementation here to be small and focused, which makes it easier to maintain. If folks want to aggregate something else, they can write their own small, focused aggregator as a separate project (or fork this one).
My guess for the second is also “absolutely”, but I think we do want to handle those additional queries in this project. It should be easy to answer the kind of queries I suggested above, filtering by submission time, and investigating the results. Pretty graphs would be nice. Using something like Kibana 2 on the client side with Elasticsearch 3 on the backend would give users the flexibility to perform fairly sophisticated analysis on their own, but you'd want the public server standing between Elasticsearch and the rest of the world 4. So something like:
PostgreSQL Elasticsearch | Your aggregation service |
---|
Client submission Kibana analysis
would be pretty slick to use and lightweight to write. If you want to support flexible analysis through some other mechanism (e.g. writing a custom UI in AngularJS, providing an API for downloading subsets of the results for local analysis, …) that's fine too. So long as you have a plan for how this is going to work.
“Don’t run Elasticsearch open to the public”
Hi @wking,
I am a third year computer science undergraduate from India, and have prior experience with Django (web interface for Docker, application where a user according to it's role can upload files and comment), Python(~1K LOC) and MySql.
I want to express my interest in this project. I have tested and understood the codes of installation-test scripts. I am trying to make a django draft app, so I wanted to know how the data would need to be stored on the server, Will the following fields in the table suffice : package_name, version_present, required_version, workhop_id ?
Please correct me if my understanding of the task is wrong
Thanks
Hi @wking ,
So after your reply, I started looking on the technologies that you had mentioned.
I started with flask and in order to get familiar with it, I tried to build a microblog over this weekend (similar to the one that has been mentioned in the documentation) and I must say I am quite impressed with it. The intention behind suggesting web2py was that it’s quite lightweight as compared to Django but I am simply amazed by flask, by its ability to offer so much inspite of being quite lightweight. I think it provides all what is needed by the project, with more convenience.
I also spent time on SqlAlchemy. After reading about it, I played with it for some time. and found it very powerful and easy to use. From the research that I did, I believe sqlalchemy provides much more control over the database and a more powerful orm as compared to the default orm provided by Django.
I actually like the idea of making the backend and frontend independant. It will offer much more freedom as well as flexibility on both ends. I would like to go with flask + sqlalchemy as backend and angularjs as the frontend. Having found some pretty good articles on it, it’s surely a tried and tested method. I have a decent experience of working with javascript, so instead of learning it, will have ample of time to spend on the productive part of the front end.
If we use angular on the front end, there are plenty of javascript libraries out there for visualising the data( for eg D3) which have been widely accepted and offer endless creativity to the user. Regarding the elastic search, I belive we can easily integrate with flask on the backend.
I do not want to rush into the implementation before properly designing the API first, as I don’t want to end up implementing something that I would have to throw away.
So, if you are good with my choice of technology, I’ll start to prepare a first draft of the api and a roadmap of the project listing all the use cases.
Thanks.
On Sun, Mar 08, 2015 at 10:06:39AM -0700, yashasvi girdhar wrote:
So, if you are good with my choice of technology, I’ll start to prepare a first draft of the api and a roadmap of the project listing all the use cases.
It sounds reasonable to me, although whenever you are integrating with a bunch of new tooling (Flask, SQLAlchemy, AngularJS, D3, …), it's good to have backup plans in case one of them turns out to be more complicated than you expect. I'd try to start off as simple as possible (and the API for pushing data is a good place for that), and then start building the server in tiny increments to get an API for retrieving data and performing simple queries before layering on the flashy frontend stuff.
On Sat, Mar 07, 2015 at 10:53:21PM -0800, Darshan Agarwal wrote:
Hi @wking,
Hi Darshan :).
… so I wanted to know how the data would need to be stored on the server, Will the following fields in the table suffice : package_name, version_present, required_version, workhop_id ?
It's up to you to decide how to represent the data in SQL, but I'd suggest more than one table. For example, you'll want to store information about the user's system and upload timing, and then for that upload there will be a reasonable number of separate packages that are installed (or not). Besides the version installed, there may be packages where we couldn't find an installed version. In that case, sometimes we have a reason where the version detection broke down, so we might want more diagnostic detail than just “no known version”.
Hi,
I absolutely agree with you on both the points. I am going to write my roadmap for the project in a way such that I would be building the api incrementally, and would have time to switch the technology, if something is not working. I am confident that I can pull it off with this combination.
I started writing the roadmap of the project yesterday and couldn’t resist on writing a schema first as it will help me to decide what all data I need to send and store.
I have shared a google doc with you, where I have written the schema that I propose and some queries that can be handled by it.
I do understand that I may need to change it in future based upon more insights that I gain into the project. But I have made a basic one that I’ll treat as a sample to work on other parts on the projects.
Please have a look at it.
Meanwhile, I will spend my time on other parts of the projects, such as the api to send the data to the server, as it would be the first step.
Thanks.
Hi @wking,
we've been talking for some time now, but I never introduced myself :-) I'm Piotr, 3rd year Automatics Control and Robotics student. I participated in GSoC 3 times, last year with @gvwilson.
I have a couple of questions for you.
swc-installation-test-2.py
and it's huuuge. Does it work correctly on all operating systems?Here are some ideas:
workshop-template/requirements.txt
(not necessarily a Python dependencies file) that's read by installation testing script and adjusts CHECKS
entries accordingly?Cheers, Piotr
On Wed, Mar 11, 2015 at 03:08:15PM -0700, Piotr Banaszkiewicz wrote:
we've been talking for some time now, but I never introduced myself :-)
Hi Piotr :). I feel like I know your software side pretty well from your amy work ;).
- How good are installation testing scripts? I glimpsed at
swc-installation-test-2.py
and it's huuuge. Does it work correctly on all operating systems?
We haven't gotten a lot of feedback, and I personally don't have access to a Windows machine for testing (and only occasionally have access to an OS X machine). But as far as I know, everything works smoothly on all of our supported OSes. Pull requests to trim down the script are welcome, but there's a lot going on here (it's basically a teensy package manager with both the code and the package tree all in one file). Splitting the components into multiple modules in a Python package would make it easier to read, but it wouldn't be as easy for our novice instructors to download and use.
- Regarding Flask and SQLAlchemy - I took it for a ride a few times already. It was not easy… I think I prefer Django for a project that has to be easy to maintain in a long run.
That matches my feelings. Django is large and opinionated, but it handles everything I need in ways I like or can work with. I don't want to have to form optionions about handling concurrent sessions, and would rather leave that to the framework. On the other hand, if someone else feels more comfortable in a different framework, I'm happy to support them. This project is scoped narrowly enough that it should be hard to create a large server codebase unless you decide to skip frameworks and libraries entirely ;).
- You mentioned Elastic Search. I always wanted to check it out - but never had a chance. Do you think it's suitable for this project? It definitely helps with advanced aggregation and stuff.
We use Elasticsearch and Kibana at work, and I like them a lot. However, that approach would mean that there are a lot of pieces in play, and you'd want to be fairly comfortable on the network/sysadmin side before heading down this route.
- Match installation feedback with workshop requirements (maybe from Amy?) to see more stats. I believe we don't track what lessons are used for workshops (and therefore what software is required)?
I'm not sure what you're suggesting here. Can you give an example?
- Maybe add
workshop-template/requirements.txt
(not necessarily a Python dependencies file) that's read by installation testing script and adjustsCHECKS
entries accordingly?
I think it's a good idea, but it's independent of aggregating the results. I've spun it off into wking/swc-setup-installation-test#2.
Hi @wking ,
I have worked on the feedback provided by you on the schema.
Also, I have prepared a timeline that I would like to follow during the project. I have shared the doc with you. Please have a look and provide your feedback on that.
Regarding the installation script, I have completely understood all of it and thought of documenting it on a wiki page here : https://github.com/itsyash/swc-setup-installation-test/wiki/Information-about-swc-installation-script-2, so that it may help others. I think I understand how I would be adding the function to send the data.
Also, I wanted to ask you what should be my next step towards the project. Should I start preparing my proposal if you are fine with the timeline, or do you want me to work on something specific first?
Thanks.
On Thu, Mar 12, 2015 at 12:00:50PM -0700, yashasvi girdhar wrote:
Regarding the installation script, I have completely understood all of it and thought of documenting it on a wiki page here : https://github.com/itsyash/swc-setup-installation-test/wiki/Information-about-swc-installation-script-2, so that it may help others.
Looks good to me. I doubt the class hierarchy tree is particularly useful, but the rest of that would make a nice comment outlining the implementation. Do you want to write that up and submit a PR against wking/swc-setup-installation-test?
Also, I wanted to ask you what should be my next step towards the project. Should I start preparing my proposal if you are fine with the timeline, or do you want me to work on something specific first?
From the questionnaire we submitted to Google 1:
What is your plan for dealing with disappearing students?
We will to try to pick students who will not disappear by require all students to have submitted at least one patch that passes review and is pushed into the code base in order to be considered…
A PR with that comment outlining the installation-test script would probably check that box.
Looks good to me. I doubt the class hierarchy tree is particularly useful, but the rest of that would make a nice comment outlining the implementation. Do you want to write that up and submit a PR against wking/swc-setup-installation-test?
Done. Firstly, I placed the comments with the respective functions, but then I thought that writing them at one place on top of the file would be of more help.
Hello @r-gaia-cs ,
This is yashasvi. I am interested in this project and as you can see from the above conversation, I have been working on this project for some days now and have done things like :
With all these things, I think I am ready to write my proposal for the project but before doing that, I wanted to ask you if there is anything else that I can do to strengthen my application.
I would really appreciate if you could help me decide what should be my next step from here.
Thanks for your time.
Hi @yashasvi,
I am interested in this project
Thanks.
I have been working on this project for some days now and have done things like :
- prepared the first draft of the schema with feedback from @wking .
- a PR merged here.
- proposed a timeline of the project that I would like to follow, with approval from @wking, and
- I have been working on the technologies that I would be using, to get a good hold of them.
With all these things, I think I am ready to write my proposal for the project but before doing that, I wanted to ask you if there is anything else that I can do to strengthen my application.
This sounds good to me.
Hey @wking
We haven't gotten a lot of feedback, and I personally don't have access to a Windows machine for testing (and only occasionally have access to an OS X machine). But as far as I know, everything works smoothly on all of our supported OSes. Pull requests to trim down the script are welcome, but there's a lot going on here (it's basically a teensy package manager with both the code and the package tree all in one file).
Other instructors here in Krakow have Windows boxes so we can sit down one day and test this script as thoroughly as we can. If I have time, I will post updates to this script.
Splitting the components into multiple modules in a Python package would make it easier to read, but it wouldn't be as easy for our novice instructors to download and use.
I agree.
This project is scoped narrowly enough that it should be hard to create a large server codebase unless you decide to skip frameworks and libraries entirely ;).
Yes! Let's do it in Bash! Or Matlab :-)
- Match installation feedback with workshop requirements (maybe from Amy?) to see more stats. I believe we don't track what lessons are used for workshops (and therefore what software is required)?
I'm not sure what you're suggesting here. Can you give an example?
I believe we don't track workshop topics. For example, at my first and only workshop, 2015-02-21-Krakow, we had bash, python, git, sql. Is that information stored anywhere? (Apart from debriefing session).
If not, we could use https://github.com/wking/swc-setup-installation-test/issues/2 as a helper to populate a list of topics for a workshop (and save it in Amy, for example, or in this project's outcome application). Then we can easily match technology required for specific workshop with installation script feedback.
- Maybe add
workshop-template/requirements.txt
(not necessarily a Python dependencies file) that's read by installation testing script and adjustsCHECKS
entries accordingly?I think it's a good idea, but it's independent of aggregating the results. I've spun it off into wking/swc-setup-installation-test#2.
I have yet to sit down and think about a timeline for this project, but this seems like an optional project goal to me.
On Sun, Mar 15, 2015 at 03:26:43PM -0700, Piotr Banaszkiewicz wrote:
We haven't gotten a lot of feedback, and I personally don't have access to a Windows machine for testing (and only occasionally have access to an OS X machine). But as far as I know, everything works smoothly on all of our supported OSes. Pull requests to trim down the script are welcome, but there's a lot going on here (it's basically a teensy package manager with both the code and the package tree all in one file).
Other instructors here in Krakow have Windows boxes so we can sit down one day and test this script as thoroughly as we can. If I have time, I will post updates to this script.
Thanks :). Please make them separate issues under wking/swc-setup-installation-test so we can stay focused on an aggregation server here.
- Match installation feedback with workshop requirements (maybe from Amy?) to see more stats. I believe we don't track what lessons are used for workshops (and therefore what software is required)?
I'm not sure what you're suggesting here. Can you give an example?
I believe we don't track workshop topics. For example, at my first and only workshop, 2015-02-21-Krakow, we had bash, python, git, sql. Is that information stored anywhere? (Apart from debriefing session).
I don't think we store that anywhere machine-parsable. If we go down this route, I'd suggest wking/swc-setup-installation-test#2 instead of using involving amy. If it wants, amy can hit the workshop homepage after an event to suck in the data. But we should keep futher discussion in wking/swc-setup-installation-test#2.
I'm closing this issue since student application period is over.
Background
Software Carpentry has installation-test scripts so students can check that they've successfully installed any software required by their workshop. However, we don't collect the results of student tests, which makes a number of things more difficult than they need to be. Statistics about installed versions would make it easy to:
Approach
This project would:
I'm not particular about the web framework you use to write the server, but I have the most experience with Django and Flask. If you prefer a different framework, I'm fine with anything that takes care of the boilerplate and lets you focus on the high-level tasks.
Challenges
Designing and implementing a simple API for storing test results, error messages, diagnostic system information, etc. We want a robust, flexible system that's small and easy to maintain going forward.
Involved toolkits or projects
Degree of difficulty and needed skills
Any of these skills could be learned during the project, but you probably can't learn all of them during the project ;).
Involved developer communities
The Software Carpentry community primarily interacts via issues and pull requests on GitHub and the
discuss@
mailing list. There's also an IRC channel.Mentors
Acknowlegements
Thanks to @xuf12 for the initial idea behind this project.