spgroup / groundhog

A framework for crawling GitHub projects and raw data and to extract metrics from them
http://spgroup.github.io/groundhog
GNU General Public License v2.0
15 stars 10 forks source link

Command line arguments from json file #33

Open dnr2 opened 11 years ago

dnr2 commented 11 years ago

We started implementing in issue #3 the -in option which allowed the user to specify a json file containing the name of the projects that were going to be downloaded and analysed by Groundhog. Now we are going to improve this -in feature in a way that the json file will provide not only the name of the projects, but also the major arguments that would be passed through command line.

We will have to define the structure of the json file, as well as decide which arguments it will contain. The list of current arguments are (extracted from the Option class):

Option(name="-forge", usage="forge to be used in search and crawling process")
Option(name="-dest", usage="destination folder into which projects will be downloaded") 
Option(name="-out", usage="output folder to metrics files")
Option(name="-datetime", usage="datetime of projects source code to be processed")
Option(name="-nprojects", usage="maximum number of projects to be downloaded and processed")
Option(name="-nthreads", usage="maximum number of concurrent threads")  
Option(name="-o", usage="determine the output format of the metrics")   
Arguments( usage="list of names of the projects to be downloaded and processed" )

Therefore I believe that a good structure to the json file would be more or less like:

{
    "forge": "github",
    "dest": "C:/groundhog/dest",
    "out": "C:/groundhog/metrics",
    "datetime": "2012-07-01_12_00",
    "nprojects": 30,
    "nthreads": 4,
    "outputformat": "csv",
    "search": [
        { "project":"rails" },
        { "project":"bootstrap" },
        { "username":"gustavopinto" }
    ]
}

P.S.: @rodrigoalvesvieira argued that it would be better to omit some arguments such as "nthreads" (only allow it in the command line itself) because this json file should only provide information concerning the projects and the searching, not the details of the computation, but we can discuss his point of view.

gustavopinto commented 11 years ago

Sorry, maybe I missed the point. Is this json the file that will be provided to java -jar groundhog.jar ... -in projects.json?

If so, will be user responsibility to create this file? Or groundhog will, somehow, create this file? I think, for me, it will be very difficult to create a file like this by hand, which could be an adoption barrier of groundhog.

:question:

dnr2 commented 11 years ago

Yes @gustavopinto, that was the initial idea. We thought that in the future, or even now, groundhog may require too many parameters to be passed through command line and that it would be tedious for the user to write each single parameter every time they open a new console/terminal. Nevertheless, some terminals limit the size of the command line parameters [1] , (although the original -in option was already solving this). The json file would be a solution for these problems.

Another advantage is that whenever we want to create a new parameter that may take many arguments (like searching for projects by usernames) it would be easy to adapt this json file to the new requirements.

I also think that this json wouldn't be so difficult to create/understand (we could also provide a sample json file). Besides, the user will still be able to use the traditional command line parameters, so anyone that is not familiar with json or the -in input format will still be capable of using groundhog.

But I understand that this format is not so user friendly. So we could change it or, maybe, discard this idea and close this issue. What you guys think? @fernandocastor, @rodrigoalvesvieira

[1] http://askubuntu.com/questions/14081/what-is-the-maximum-length-of-command-line-arguments-in-gnome-terminal

fernandocastor commented 11 years ago

I think this is, at least currently, our best option to specify the search parameters. Your arguments just strengthened this impression, @dnr2. I don't think using json is an obstacle as it would be if we employed XML or a domain-specific language. As for the parameters, we should focus on the ones we already know and create our system so that it is extensible.

gustavopinto commented 11 years ago

hmm.. ok! great arguments! It really change my opinion :wink:

gustavopinto commented 11 years ago

Did you have already started the implementation of this issue?

rodrigoalvesvieira commented 11 years ago

he has https://github.com/spgroup/groundhog/tree/ft-metrics-output-csv

dnr2 commented 11 years ago

Not yet @gustavopinto, I normally assign myself to an issue whenever I start implementing it.

rodrigoalvesvieira commented 11 years ago

oops

gustavopinto commented 11 years ago

I changed a bit the json format in the search attribute.

{
    "forge": "github",
    "dest": "C:/groundhog/dest",
    "out": "C:/groundhog/metrics",
    "datetime": "2012-07-01_12_00",
    "nprojects": 30,
    "nthreads": 4,
    "outputformat": "csv",
    "search": {
        "projects": ["rails", "bootstrap"],
        "username":"gustavopinto"
    }
}

But, I'm thinking if projects and username are independent or related. For example, are rails and bootstrap projects created by gustavopinto? Or, in this file, I want to download rails and bootsrap and also download all projects created by gustavopinto?

rodrigoalvesvieira commented 11 years ago

For me, it'd mean: "download the 'rails' and 'bootstrap' projects from the user 'gustavopinto'". Anything else looks very confusing to me.

gustavopinto commented 11 years ago

Ok. Another question: Are projects and username required? If the user do not pass the projects attribute, it will download all projects created by 'gustavopinto'? Or it simply does not work?

Moreover, could I pass more than one username?

rodrigoalvesvieira commented 11 years ago

It would work. Adding both projects and username is just a way of narrowing the search (diminishing the possibilities of results). Providing only projects should download them independently of the username and providing only username should return download all projects created by that user, as you mentioned.

dnr2 commented 11 years ago

Agreed! I think the same way as @rodrigoalvesvieira. Nevertheless, we should consider that the user may want to make different kind of searches at once. e.g : I may want make searches about both (projects related to: groundhog created by the user: gustavopinto) AND (projects related to: bootstrap created by the user: dnr2). We could provide this functionality by modifying the structure of the JSON (possibly creating an array of searches), but this may become a bit complicated for the user.

fernandocastor commented 11 years ago

I agree with @dnr2 in that we should provide some kind of operator for users to specify both ANDs and ORs. To me, the simplest answer would be to think about username and projects as specifying sets of projects and multiple items would always have an AND semantics for items within a search clause. For example:

"search": { "projects": ["rails", "bootstrap"], "username":"gustavopinto" }

would mean "download projects rails and bootstrap created by the user gustavopinto". What if we want to download every project by user gustavopinto and, at the same time, projects named rails and bootstrap? We could specify two different search clauses in the same JSON file:

"search": { "projects": ["rails", "bootstrap"], }

"search": { "username":"gustavopinto" }

This would have an OR semantics, instead of AND and would get a considerably larger number of projects. A relevant question in this case is: what if there are multiple projects named "rails"? Do we download them all? Moreover, what other kinds of options are we interested in supporting? For example, do we need to support a search where the user wants to download only projects FORKED by the user "gustavopinto"? Would that be required to answer any of those RQs?

What do you think of this solution?

gustavopinto commented 11 years ago

The last commit enables groundhog to use AND and OR (only thru json file) semantics. In the future we can add more parameters, such as is_fork or watchers.

rodrigoalvesvieira commented 11 years ago

whoa! :+1:

dnr2 commented 11 years ago

Cool!! =D

dnr2 commented 11 years ago

@gustavopinto, Is this issue already implemented?

If the answer is yes, then we should close it...

gustavopinto commented 11 years ago

This issue was labeled as 'continuous'. So, it may change during the groundhog evolution, and thus, we should keep it open.