OpenHub API Interfacing for Project Search

sailuh / kaiaulu

An R package for mining software repositories

http://itm0.shidler.hawaii.edu/kaiaulu

Mozilla Public License 2.0

20 stars 12 forks source link

OpenHub API Interfacing for Project Search #317

Open beydlern opened 1 month ago

beydlern commented 1 month ago

1. Purpose

OpenHub is a website that indexes open-source projects with their respective project information (i.e. lines of code, contributors, etc). The purpose of this task is to extend R/config.R to host a collection of functions that interface with OpenHub's API, Ohloh, to help facilitate in locating open source projects for analysis.

2. Process

Create a collection of functions implemented in R/config.R, where each function will grab one endpoint (item of project information, such as the number of lines of code). Create a notebook to demonstrate how to use these R/config.R Ohloh API interfacing functions to request information on an open-source project on OpenHub.

Checklist for Extractable Project Information

[x] name: The name of the project.
[x] id: The project's unique ID on OpenHub.
[x] primary_language: The primary code language used by the project.
[x] activity: The project's activity level (Very Low, Low, Moderate, High, and Very High).
[x] html_url: The project's URL on OpenHub.
[x] mailing_list: The project's mailing list URL link (may be "N/A" if unable to find, checking html_url of the project to verify is advised).
[x] min_month: OpenHub's first recorded year and month of the project's data (typically the date of the project's first commit, YYYY-MM format).
[x] twelve_month_contributor_count: The number of contributors who made at least one commit to the project source code in the past twelve months.
[x] total_contributor_count: The total number of contributors who made at least one commit to the project source code since the project's inception.
[x] twelve_month_commit_count: The total number of commits to the project source code in the past twelve months.
[x] total_commit_count: The total number of commits to the project source code since the project's inception.
[x] total_code_lines: The most recent total count of all source code lines.
[x] code_languages: A language breakdown with percentages for each substantial (as determined by OpenHub, less contributing languages are grouped and renamed as "Other") contributing language in the project's source code.

Example Endpoint Pathing

This specific comment in this issue thread details the endpoint pathing process to look for a specific project's analysis collection under an organization's portfolio projects, specified by project name (project names are unique in OpenHub).

3. Task List

[x] Apply for an API key for Ohloh API.
[x] Understand how to form a request to Ohloh API.
[x] Understand the response XML format after an HTTP GET request.
[x] Create the interfacing functions to extract the extractable project information and search for projects; Implement these functions in R/config.R.
[x] Create a notebook, vignettes/openhub_project_search.Rmd, to demonstrate how to use the new R/config.R functions that interface with Ohloh API to extract useful information about a project and search through OpenHub's database of projects to search for a project based on a set of filters for analysis.

Function Signatures

[x] openhub_api_organizations()
[x] openhub_api_portfolio_projects()
[x] openhub_api_projects()
[x] openhub_api_analyses()
[x] openhub_parse_organizations()
[x] openhub_parse_portfolio_projects()
[x] openhub_parse_projects()
[x] openhub_parse_analyses()
[x] openhub_api_iterate_pages()

carlosparadis commented 1 month ago

Hi @beydlern,

This is what I originally sent an e-mail awhile back:

Add module to interface with OpenHub API to facilitate locating open source projects for studies. API details here. May complement with extracting information from GitHub hosted projects.

Here's an example of our project listed there: https://openhub.net/p/kaiaulu you can also take a look at others from our project config files like OpenSSL, etc.

This is a task I know less, so part of the issue is to assess the viability of what we want as part of the task itself. For instance, there are a few things @rnkazman would like to know when considering a project (Rick feel free to chime in):

LOC on current date (so we know the size of the project)
Commits per month over the last year (or any time range available): So we know the project is still alive
Contributors per month: Good to contrast with LOC to know if this is a one person project
Language (so we know the language of the project)

What you want to do for Kaiaulu is try to create one function per endpoint to begin with. For example, if you look at R/github.R in Kaiaulu, you will see that even the docs of the function tells you what endpoint is accessing. So, start by documenting on your specification if you can get the information above (which is displayed on the interface of OpenHub), and afterwards any other information I did not consider (or you can point me to a PDF or page with all endpoints).

Remember github.R, jira.R both use APIs (I'd recommend github as you are using it as part of this project and can relate), so much of the code you may need is already there for you to use as example. Reusing code logic will also automatically help you ensure consistency. mail.R is not an API (what @daomcgill is working on), so i'd recommend against use that as reference.

Depending on your findings, we may also simply add a few more endpoints to github.R to collect some of this information. However, OpenHub is preferred because they can extract info beyond GitHub itself. Let me know if you have questions.

beydlern commented 1 month ago

@carlosparadis Before I can take an in depth look at the XML-formatted data, which is the response format after I make a project request, I must register for an API key. What should I put under the Application Name, Redirect URI, and Description sections of this API key request application?

carlosparadis commented 1 month ago

I'm not sure what it wants as redirect uri, but you can put app name as ics 496 kaiaulu. Description can be capstone class project.

beydlern commented 1 month ago

@carlosparadis

. . . if you can get the information above (which is displayed on the interface of OpenHub), and afterwards any other information I did not consider (or you can point me to a PDF or page with all endpoints).

From my understanding, the Ohloh API allows users with a valid API key to request an XML-formatted data in response to an HTTP GET request for a project. This XML file for the specific project contains an analysis section that holds general information about the project, such as the total LOC, the main language, and the number of contributors who made at least one commit in the last year. This analysis section comes with its unique ID, id, that may be used to locate its children, the size_facts and activity_facts. The size_facts statistics provides monthly (month is explicitly shown YYYY-MM-DD) running totals of LOC, commits, and developer effort expressed through cumulative total months of effort by all contributors on the project (man_months), and activity_facts statistics provides the changes to LOC, commits, and contributors per month (also shown in YYYY-MM-DD).

LOC on current date (so we know the size of the project)

Commits per month over the last year (or any time range available): So we know the project is still alive

Contributors per month: Good to contrast with LOC to know if this is a one person project

Language (so we know the language of the project)

To my knowledge, as long as the OpenHub website has computed and stored these statistics in an analysis, these requests are possible with Ohloh API. However, the analysis for the current date may be slightly inaccurate as the OpenHub website must be given time to compute the analysis as the latest month it has computed may refer to an older month than our current month (as shown in the analysis.md in Ohloh's API documentation max_month: The last month for which monthly historical statistics are available for this project. Depending on when this analysis was prepared, max_month usually refers to the current month, but it may be slightly older.)

carlosparadis commented 1 month ago

https://github.com/blackducksoftware/ohloh_api/blob/main/README.md#xml-response-format

I see, nice finding. So the reference folder basically contain on every file the format of xml you will get if you go after that endpoint, is that right?

carlosparadis commented 1 month ago

I took a quick look on the wiki and i don't see an example file. Could you try retrieving the analysis XML for kaiaulu so we could take a look? It seems some XML are summary statistics coming out of this file too so this may be all we need.

Unfortunately they delete old files:

An individual Analysis never changes. When a Project’s source code is modified, a completely new Analysis is generated for that Project. Eventually, old analyses are deleted from the database. Therefore, you should always obtain the ID of the best current analysis from the project record before requesting an analysis.

So in that sense, Ohloh API will never serve to comprehensively analyze a project history, but it does conveniently offer summary statistics.

I also want to remind you that our goal here is to survey "the sea of projects" for the criteria we want, rather than use Ohloh to analyze them on our behalf. For example, rather than analysis.md, what we may need is something like this:

https://openhub.net/orgs/apache/projects

https://github.com/blackducksoftware/ohloh_api/blob/main/reference/portfolio_projects.md

That gives us a list of projects.

Another thing that would be useful is knowing which type of issue tracker a given project uses: https://openhub.net/p/apache/links you can imagine if that was returned in the XML, we could parse the URL from issue tracker for certain words to find if it is bugzilla, JIRA, or GitHub, and then report that in a table for the user.

I do not know if OpenHub will let us search all the projects they index, or if we can only search at most per organization.

Could you check what else in OpenHub could give us a bird eyes view of all the projects? We could still create a two step pipeline to maybe first obtain the name of the projects via one endpoint, and then make more API calls for the analysis to obtain the detailed information, although this would be less than ideal.

beydlern commented 1 month ago

@carlosparadis I was able to take a look at kaiaulu's project information. Here is the XML file data for the project:

<response>
  <status>success</status>
  <result>
<project>
  <id>760420</id>
  <name>kaiaulu</name>
  <url>https://openhub.net/p/kaiaulu.xml</url>
  <html_url>https://openhub.net/p/kaiaulu</html_url>
  <created_at>2021-09-27T02:32:26Z</created_at>
  <updated_at>2024-10-14T05:19:13Z</updated_at>
  <description>A data model for Software Engineering data analysis</description>
  <homepage_url>http://itm0.shidler.hawaii.edu/kaiaulu</homepage_url>
  <download_url>https://github.com/sailuh/kaiaulu</download_url>
  <url_name>kaiaulu</url_name>
  <vanity_url>kaiaulu</vanity_url>
  <medium_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/94361/logo_med.png</medium_logo_url>
  <small_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/94361/logo_small.png</small_logo_url>
  <user_count>0</user_count>
  <average_rating/>
  <rating_count>0</rating_count>
  <review_count>0</review_count>
  <analysis_id>207699501</analysis_id>
  <tags>
    <tag>code_analysis</tag>
    <tag>codemanagement</tag>
    <tag>mining-software-repositories</tag>
    <tag>socialnetwork</tag>
    <tag>softwareengineering</tag>
    <tag>static_analysis</tag>
  </tags>
<analysis>
  <id>207699501</id>
  <url>https://openhub.net/p/kaiaulu/analyses/207699501.xml</url>
  <project_id>760420</project_id>
  <updated_at>2024-10-14T05:19:13Z</updated_at>
  <oldest_code_set_time>2024-10-13T17:03:09Z</oldest_code_set_time>
  <min_month>2020-05-01</min_month>
  <max_month>2024-10-01</max_month>
  <twelve_month_contributor_count>6</twelve_month_contributor_count>
  <total_contributor_count>14</total_contributor_count>
  <twelve_month_commit_count>18</twelve_month_commit_count>
  <total_commit_count>186</total_commit_count>
  <total_code_lines>5085</total_code_lines>
  <factoids>
    <factoid type="FactoidCommentsVeryHigh">
Very well-commented source code    </factoid>
    <factoid type="FactoidAgeOld">
Well-established codebase    </factoid>
    <factoid type="FactoidTeamSizeAverage">
Average size development team    </factoid>
    <factoid type="FactoidActivityDecreasing">
Decreasing Y-O-Y development activity    </factoid>
  </factoids>
  <languages graph_url="https://openhub.net/p/kaiaulu/analyses/207699501/languages.png">
    <language percentage="100" color="198CE7" id="65">
R    </language>
  </languages>
  <main_language_id>65</main_language_id>
  <main_language_name>R</main_language_name>
</analysis>
  <similar_projects>
    <project>
      <id>360</id>
      <name>FindBugs</name>
      <vanity_url>findbugs</vanity_url>
    </project>
    <project>
      <id>712198</id>
      <name>Prospector (Python)</name>
      <vanity_url>landscapeio-prospector</vanity_url>
    </project>
    <project>
      <id>733309</id>
      <name>SpotBugs</name>
      <vanity_url>spotbugs</vanity_url>
    </project>
    <project>
      <id>1865</id>
      <name>GNU cflow</name>
      <vanity_url>cflow</vanity_url>
    </project>
  </similar_projects>
  <licenses>
  </licenses>
  <project_activity_index>
    <value>30</value>
    <description>Very Low</description>
  </project_activity_index>
</project>
  </result>
</response>

On the topic of searching for projects, we are able to write a query to request a set of projects filtered through some specification. https://www.openhub.net/projects.xml?api_key={api_key}&page={number}&sort={keyword}&query={keyword} If the query returns multiple projects, we get a filtered collection response xml format that we can use to get the name and id of each project, which we can use to get their analyses through another request to get general information on each project, but like you said, this is less than ideal. However, this allows us to search through ALL projects that OpenHub indexes.

For the issue trackers inquiry, if the project page has the links section, the project's XML response will also contain a links section. For example, the page, https://openhub.net/p/apache/links can be shown in the project's XML file:

<links>
  <link>
    <title>Current Release Docs</title>
    <url>http://httpd.apache.org/docs/current/</url>
    <category>Documentation</category>
  </link>
  <link>
    <title>Next release "coming soon" docs</title>
    <url>http://httpd.apache.org/docs/trunk/</url>
    <category>Documentation</category>
  </link>
  <link>
    <title>Apache Bugzilla</title>
    <url>https://issues.apache.org/bugzilla/</url>
    <category>Issue Trackers</category>
  </link>
  <link>
    <title>Bugzilla Search</title>
    <url>https://issues.apache.org/bugzilla/query.cgi</url>
    <category>Issue Trackers</category>
  </link>
</links>

We could parse through these links to see what issue trackers the project is using.

carlosparadis commented 1 month ago

This is going at the right direction, thank you for the additional information!

In regards to what you said:

On the topic of searching for projects, we are able to write a query to request a set of projects filtered through some specification.

I looked at the URL and saw:

query - Results will be filtered by the provided string. Only items that contain the query string in their names or descriptions will be returned. Filtering is case insenstive. Only alphanumeric characters are accepted. All non-alphanumeric characters will be replaced by spaces. Filtering is not available on all API methods, and the searched text depends on the type of object requested. Check the reference documentation for specifics.

Could you check what exactly what we can query for? Seems we can query by language across all OpenHub, if so this is already a great start. It is not the end of the world to do follow-up check on other API endpoints if that is the only way forward. However, this then begs the question: Let's say that our query returns 300 or so projects. For every project, in order for us to find if the project is or not jira, and also the other information i said above (n contributors, LOC, etc), how many API calls will that require per project?

Also, what was the limit again of API calls? And what was the time period? (per day it resets?)

beydlern commented 1 month ago

@carlosparadis

For project queries, the project reference documentation states:

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

I believe that we can query for anything, to be specific, the query string acts as a search pattern and the Ohloh API searches through every tag to check if the query string is contained.

When the Ohloh API returns the XML data for a list of projects (if the query returns projects), it returns a maximum of ten per page (through personal testing) and it also lists the total number of items (projects) available.

An example of querying a list of projects with the query string: bugzilla (The result information (list of projects) are not shown here as it's too long)

<status>success</status>
<items_returned>10</items_returned>
<items_available>80</items_available>
<first_item_position>0</first_item_position>

According to the documentation:

page - In most cases, the Ohloh API returns at most 25 items per request. Pass this parameter to request subsequent items beyond the first page. This parameter is one-based, with a default value of 1. If you pass a value outside the range of available pages, you will receive the first page.

The next set of ten projects are listed on the next pages, where we may do simple arithmetic with the items_available tag value divided by the non-zero items_returned value, making sure to take the ceiling of this value to get the number of pages that we may increment through to get the rest of the projects XML data.

Let's say that our query returns 300 or so projects. For every project, in order for us to find if the project is or not jira, and also the other information i said above (n contributors, LOC, etc), how many API calls will that require per project?

In this case, with ten projects listed per page, it would take 330 API calls. This is because it would take 30 API calls to look at each project because there are 30 pages of projects (ten projects displayed per page), and another API call is added for each project using its analysis_id to grab its corresponding analysis to extract the project information (LOC, number of contributors, etc).

Also, what was the limit again of API calls? And what was the time period? (per day it resets?)

The number of API calls a user can make per API key is 1000, and this resets every 24 hours.

carlosparadis commented 1 month ago

When you say we could query anything, would we be able to create for example, a query that asks for LOC >= 50k? And would we be able to add "And" conditions, e.g. LOC >50k & n.contributors >= 20? Equally curious if can also add language to the query. If you could share on the shared drive and email me the URL of the longer version of the file in a single page it would help me understand a bit further. I am slightly confused on the query for all projects still.

beydlern commented 1 month ago

@carlosparadis My mistake, I wasn't clear by what I meant by "anything". I meant that when we search for projects, we can search for "any string pattern" that can be in the query string to search through the properties for each project. The number of properties we can search through is limited, unless I also do API call for each project to open its analysis child to search through its properties, where the LOC and number of contributors is found. However, this would quickly add up in the number of API calls.

There is no extra functionality with the query collection request parameter, so there is no Boolean logic nor mathematical relationships. To clarify, when searching through projects, the query command takes a query string, which is just an alphanumeric string:

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

carlosparadis commented 1 month ago

@beydlern I looked at the XML you sent, thanks! Did you query the name property? because I see only "bugzilla" named projects in it. Would you be able to send me something that the query is a project that is written in java?

I think at the very least you should start with the organization one: https://github.com/blackducksoftware/ohloh_api/blob/main/reference/organization.md

And try on Apache Software Foundation.

The analysis endpoint also seems promising.

For the pagination, you can take a look on the GitHub and JIRA downloaders, I believe both implement similar logic. Might as well reuse for consistency. If you could send me an example of both XML, that would be great.

Just remember what type of information we are after in our search, and consider how we can get there via endpoints.

beydlern commented 4 weeks ago

@carlosparadis

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

It looks like we can't query the name tag/property specifically or query a search for a pattern in any tag specifically. External code (in config.R) may be needed to complement the "bugzilla" query search to look at each <name> tag.

Would you be able to send me something that the query is a project that is written in java? ... And try on Apache Software Foundation.

Example: To get a project with its primary language as Java, starting with a given organization, "Apache Software Foundation", I request the organization's XML data, viewing its portfolio projects, to get the <detailed_page_url> field: https://openhub.net/orgs/apache.xml?api_key={api_key}&view=portfolio_projects

...
<detailed_page_url>/orgs/apache/projects</detailed_page_url>
...

Another request for this new page url's XML data will give us a paginated list of portfolio projects belonging to the organization (a sample with one project returned): https://openhub.net/orgs/apache/projects.xml?api_key={api_key}

<response>
  <status>success</status>
  <items_returned>20</items_returned>
  <items_available>320</items_available>
  <first_item_position>0</first_item_position>
  <result>
    <portfolio_projects>
      <project>
        <name>Apache Tomcat</name>
        <activity>High </activity>
        <primary_language>java</primary_language>
        <i_use_this>1684</i_use_this>
        <community_rating>4.2</community_rating>
        <twelve_mo_activity_and_year_on_year_change>
          <commits>1059</commits>
          <change_in_commits>-35</change_in_commits>
          <percentage_change_in_commits>3</percentage_change_in_commits>
          <contributors>24</contributors>
          <change_in_contributors>-14</change_in_contributors>
          <percentage_change_in_committers>36</percentage_change_in_committers>
        </twelve_mo_activity_and_year_on_year_change>
      </project>
      ...
    </portfolio_projects>
  </result>
</response>

The portfolio projects entity doesn't allow collection request commands, such as queries or sorting, so external code may be necessary to read each project to find the desired information, and in this case external code is needed: External code (in config.R) is needed (there is no external code yet, this is just an example, and the pagination logic needed for this is found in the GitHub and JIRA downloaders) in this example to cycle through each page to find a project that contains the string "Java" or "java" in a project's <primary_language> tag. For further analysis, we will copy the selected project's <name> tag too, where we will then go to the global projects paginated list in XML format and query for the name of the project (querying for the name of the project will also return every project if that project has any tag or description that has either the word "Apache" or "Tomcat", which is why there were 38 projects returned): https://openhub.net/p.xml?api_key={api_key}&query=Apache%20Tomcat

<response>
  <status>success</status>
  <items_returned>10</items_returned>
  <items_available>38</items_available>
  <first_item_position>0</first_item_position>
  <result>
<project>
  <id>3562</id>
  <name>Apache Tomcat</name>
  <url>https://openhub.net/p/tomcat.xml</url>
  <html_url>https://openhub.net/p/tomcat</html_url>
  <created_at>2006-11-12T20:40:37Z</created_at>
  <updated_at>2024-10-20T08:17:44Z</updated_at>
  <description>The Apache Tomcat software is an open source implementation of the Java Servlet, JavaServer Pages, Java Expression Language and Java WebSocket technologies.</description>
  <homepage_url>http://tomcat.apache.org/</homepage_url>
  <download_url>http://tomcat.apache.org/download-60.cgi</download_url>
  <url_name>tomcat</url_name>
  <vanity_url>tomcat</vanity_url>
  <medium_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/831/tomcat_med.png</medium_logo_url>
  <small_logo_url>https://s3.amazonaws.com/cloud.ohloh.net/attachments/831/tomcat_small.png</small_logo_url>
  <user_count>1684</user_count>
  <average_rating>4.23101</average_rating>
  <rating_count>316</rating_count>
  <review_count>4</review_count>
  <analysis_id>208382336</analysis_id>
  <tags>
    ...
  </tags>
  <similar_projects>
    ...
  </similar_projects>
  <licenses>
    ...
  </licenses>
  <project_activity_index>
    ...
  </project_activity_index>
  <links>
    ...
  </links>
</project>
...
  </result>
</response>

Every project name is unique in OpenHub, so once we find a matching <name> tag, we may take its <id>, its unique project id, to find its latest analysis https://openhub.net/p/3562/analyses/latest.xml?api_key={api_key}:

<response>
  <status>success</status>
  <result>
<analysis>
  <id>208382336</id>
  <url>https://openhub.net/p/tomcat/analyses/208382336.xml</url>
  <project_id>3562</project_id>
  <updated_at>2024-10-20T08:17:44Z</updated_at>
  <oldest_code_set_time>2024-10-19T15:16:40Z</oldest_code_set_time>
  <min_month>2006-03-01</min_month>
  <max_month>2024-10-01</max_month>
  <twelve_month_contributor_count>24</twelve_month_contributor_count>
  <total_contributor_count>181</total_contributor_count>
  <twelve_month_commit_count>1059</twelve_month_commit_count>
  <total_commit_count>26695</total_commit_count>
  <total_code_lines>474323</total_code_lines>
  <factoids>
    <factoid type="FactoidAgeVeryOld">
Mature, well-established codebase    </factoid>
    <factoid type="FactoidTeamSizeLarge">
Large, active development team    </factoid>
    <factoid type="FactoidCommentsAverage">
Average number of code comments    </factoid>
    <factoid type="FactoidActivityStable">
Stable Y-O-Y development activity    </factoid>
  </factoids>
  <languages graph_url="https://openhub.net/p/tomcat/analyses/208382336/languages.png">
    <language percentage="82" color="9A63AD" id="5">
Java    </language>
    <language percentage="9" color="555555" id="3">
XML    </language>
    <language percentage="7" color="556677" id="35">
XML Schema    </language>
    <language percentage="2" color="000000" id="">
10 Other    </language>
  </languages>
  <main_language_id>5</main_language_id>
  <main_language_name>Java</main_language_name>
</analysis>
  </result>
</response>

This path from endpoint to endpoint allows us to get all the relevant information on a project in at least 4 API calls. This example focused on requesting a project in a specified organization where its primary language is in Java. How is this approach?

carlosparadis commented 3 weeks ago

I am not clear what endpoints you are actually using from your responses above, could you edit the message to make that more clear (pointing to the .md documentation), and then post a comment to let me know you edited?

beydlern commented 3 weeks ago

@carlosparadis I edited the message for clarity. Each URL has turned into a link that points to its respective .md documentation page in Ohloh API.

carlosparadis commented 3 weeks ago

Thanks this makes a lot more sense to me now.

The portfolio projects entity doesn't allow collection request commands, such as queries or sorting, so external code may be necessary to read each project to find the desired information, and in this case external code is needed: External code (in config.R) is needed (there is no external code yet, this is just an example, and the pagination logic needed for this is found in the GitHub and JIRA downloaders) in this example to cycle through each page to find a project that contains the string "Java" or "java" in a project's tag. For further analysis, we will copy the selected project's tag too, where we will then go to the global projects paginated list in XML format and query for the name of the project (querying for the name of the project will also return every project if that project has any tag or description that has either the word "Apache" or "Tomcat", which is why there were 38 projects returned):

Is there no way to go from the project name from the portfolio, straight into its analysis instead? Having to do a global search for the project seems redundant.

Also:

or further analysis, we will copy the selected project's tag too, where we will then go to the global projects paginated list in XML format and query for the name of the project (querying for the name of the project will also return every project if that project has any tag or description that has either the word "Apache" or "Tomcat", which is why there were 38 projects returned): https://openhub.net/p.xml?api_key={api_key}&query=Apache%20Tomcat

Do we need to perform this global search? Can't we use the name of the project from the organization search on the project.md endpoint to retrieve it? https://github.com/blackducksoftware/ohloh_api/blob/main/reference/project.md

I guess one thing I am still confused is what on the .md says that you can query or not. Is it just doing a ctrl+f on the entire output against whatever you query? I don't understand yet how you are specifying a tag.

All things considered above, you can go ahead and start the function for the organization search, and include a parameter where you can specify the language we are looking for. The notebook can exemplify Apache, since that is often studied. Remember to update the specification too on the first issue.

Do let me know on the two questions on this comment so we can sort out the final path here, but at least that lets you start going on the code. I'd recommend using the R/github.R as reference on how to do the pagination. Try to reuse the code as much as possible so it stays consistent to everything else.

Thanks!

beydlern commented 3 weeks ago

@carlosparadis

Is there no way to go from the project name from the portfolio, straight into its analysis instead? Having to do a global search for the project seems redundant.

To get the latest analysis collection for a project (the latest analysis is the current best analysis collection for a single project), you need its project_id tag (called id in the project collection), which is only found in the project collection, and a good unique key that portfolio_projects and project both have is the name tag. The name tag then allows me to jump from the portfolio_projects endpoint to the project endpoint to get the project_id tag to access the latest analysis collection for that project: https://www.openhub.net/projects/{project_id}/analyses/latest.xml.

Do we need to perform this global search? Can't we use the name of the project from the organization search on the project.md endpoint to retrieve it? https://github.com/blackducksoftware/ohloh_api/blob/main/reference/project.md

To access a specific project's collection, we need to specify its project_id: https://www.openhub.net/projects/{project_id}.xml We must perform this global search, but the term "global" may be misleading because almost always the maximum number of API calls on this global project list is 1 (From personal experimentation, each data request to the global project list is ten projects). Querying using the name of the project allows us to almost always get the correct project on the first page of the project return list, so we may get the 'project_id' tag.

I guess one thing I am still confused is what on the .md says that you can query or not. Is it just doing a ctrl+f on the entire output against whatever you query? I don't understand yet how you are specifying a tag.

From the project collection's supported query request, it states:

query If supplied, only Projects matching the query string will be returned. A Project matches if its name, description, or any of its tags contain the query string.

Following this and my experiences with this querying command, the command seems like it just does a ctrl+f on the entire output. I'm not specifying a specific tag to query, I'm hoping to use external code that will aid in paginating and "actually query" for the tag that I am looking for. The query command they use is helpful to narrow down a list of matches, but my external code (that I will write) will use this narrowed list to look for the tag and information that I am interested in.

I'll get started on the code!

carlosparadis commented 3 weeks ago

Sounds good, thank you for confirming the strange ctrl+f mechanism it uses. I would 100% document this on the function that deals with this endpoint. Your last comment also has much of what I think should be on the notebook, as you explain why you are calling the functions in that order.

Please remember to update your issue specification with the function signatures, and a bit of the summarized rationale of your comment above (I would also put a reference to this particular comment, since that was the one that did the trick to help me make sense).

Remember, the questions I asked here are likely some of the questions someone reading the notebook will have, so you can use them as guidance on what to put in the Notebook (if it addresses all my questions, then it is off to a good start!). It should also help you consider what to go on the function documentation.

In one of the calls we can consider what else OpenHub offers, but for now at least we can search per organization, as I expect Apache will be heavily utilized due to often using JIRA for issue tracker (which in turn means bugs are documented).

Nice work!

beydlern commented 2 weeks ago

@carlosparadis Did you want me to create a configuration file or edit an existing one (e.g. kaiaulu.yml) to add in my configuration variables from the notebook, openhub_project_search.Rmd (the code language to search for, language, and the organization name, organization_name)?

If a configuration file is to be used, appropriate getter functions may be needed. Let me know if this is something I should also do too and on which issue I should do this.

carlosparadis commented 2 weeks ago

I think the first thing is discussing with me what the config should look like. The configs I think already have an openhub section (or maybe not)? If so we need to revise that first. All the config file formats should be updated accordingly, which, this time around, will only affect the get() functions being added or edited, rather than the notebooks, right?

I am hoping the m1 merge for that will get done tomorrow. Sorry it has taken so long.

crepesAlot commented 2 weeks ago

@beydlern mentioned me chiming in for the formatting so here is my suggestion. The current format in the config file is:

project:
  website: https://thrift.apache.org
  openhub: https://www.openhub.net/p/thrift

Suggested new format:

project_url: https://thrift.apache.org

api:
  openhub:
    # URL of project on OpenHub 
    openhub_url: https://www.openhub.net/p/thrift
    # Name of the organization
    organization_name: temp_name
    # What language to filter for
    language: java

Moved the project url to be separate from the openhub section as that isn't specific only to openhub.
Made a broader api section so that it would be easier if any future api might need specifications in the config files.

These are my suggestions, not sure if Nick would want to use them as is, or if he'll be making further modifications to suit the purpose of the new fields.

carlosparadis commented 2 weeks ago

This comment is a good learning opportunity: Not everything that is a parameter in the Notebook goes to the config file. We have to account for what is the "granularity" of the config (which is a project), and also whether said information can be obtained automatically or typed manually (which is more human time consuming).

The openhub_url for example is consistent to the project granularity, and it is something you have to specify as a starting point. The language, however, is something we could ask Kaiaulu to get from OpenHub API, right? I would suspect that the organization name is something we also can infer.

Some endpoints that we care a lot about are inconsistent to the granularity of a project configuration file, for example, the organization being Apache. This means the parameter will be hardcoded in the notebook (in a code block right below the config loads so it is easier for people to find). Maybe in the future, if Kaiaulu supports various organization level analysis we could consider a section for config, or an entire new config. There is a trade-off in this decision on the exec scripts, since they generally should take as input the config file rather than additional parameters. Which is to say, I expect this notebook to not have an exec/.

In this case at least this somewhat makes sense, since, contrary to a downloader where we want to throw in a server downloading data, it strikes me a bit odd that someone would be using the OpenHub API to do that. The OpenHub API in Kaiaulu case is to help us select the project, but nothing beyond that.

Let me know if this makes sense.

crepesAlot commented 2 weeks ago

That makes sense, I take this would be similar to how Dao removed the month_year parameter for her mailing list from the config previously.

carlosparadis commented 2 weeks ago

Precisely. And a user could reasonably calculate the time range from @daomcgill filenames too programmatically. With that being said, Kaiaulu does need at some point to have an analysis config, since one project can have multiple analysis, and I have seen in the past people having to duplicate the project config several times due to that.

For example, sometimes it is interesting analyzing projects over multiple releases, or the same project using different parameters in the tool. The project is the same, it is the analysis that differs. But, all in all, resolving the config format (your m1), the storage (your m1), and the creation of said folders (your m2), and how the tables relate (@RavenMarQ m2) was needed to even begin thinking what that would like. I expect the Spring group will have to "refactor" that analysis config.

The good news is that they will only have to edit get functions instead of all notebooks thanks to your efforts ;)

beydlern commented 2 weeks ago

@carlosparadis The openhub_project_search.Rmd notebook has been updated and it takes inspiriation from our comment thread (mainly this comment). Because the language and organization_name parameters in my notebook don't match the granularity of the configuration files (project-level), I'm leaving these hardcoded in the notebook right?

carlosparadis commented 2 weeks ago

Yes. Try to focus the notebook as a story you would write in a blog. You are someone about to do an analysis, and you want to find what projects to consider. You choose Apache, because they often use JIRA, which contains bug data. Now you want to find the questions Rick are interested (did you e-mail him to ask?).

Then you explain how using each function of the API you can answer them.

beydlern commented 1 week ago

Rick's response:

I see that you already have some criteria that you have considered: · LOC on current date (so we know the size of the project) · Commits per month over the last year (or any time range available): So we know the project is still alive · Contributors per month: Good to contrast with LOC to know if this is a one person project · Language (so we know the language of the project) In addition I would like to know: · the languages used, and in what percentages of the total LOC · the age of the project · whether they have a mailing list (or lists) and how active these lists are Some "wish list" items, that are probably not available on openhub, are things like whether they labels their commits with issue IDs, whether they have some interesting issue types such as "bug", "feature", "security bug", "refactoring"

beydlern commented 1 week ago

@carlosparadis I added the checklist for Rick's desired extractable information for the projects. I don't see an endpoint that would inform me about the mailing list(s) for the projects. In his wishlist, he asks if it can be known if a project labels their commits with issue IDs, could this be confirmed if the project uses the JIRA issue tracker?

carlosparadis commented 1 week ago

@beydlern extract the project links and argue if they are not found via that the user should still check the website to be certain. Here's an example from apache project spark that lists the mailing lists as a URL: https://openhub.net/p/apache-spark.

The way to check the issue ids requires you to parse the project's code git log. Then you can use this on the resulting table: http://itm0.shidler.hawaii.edu/kaiaulu/reference/commit_message_id_coverage.html

See this notebook for example usage: http://itm0.shidler.hawaii.edu/kaiaulu/articles/bug_count.html#identifying-issue-ids-in-commit-messages this also uses the regex written in the project configuration file, which is a regex. That the user will need to manually figure out from the git log if any can be found, to then specify in Kaiaulu config, to them have Kaiaulu calculate the metric. No other way to automate that since the conventions used vary across projects, if at all used.

beydlern commented 1 week ago

@carlosparadis I updated the checklist in my issue specification comment: I added the code languages with their percentages, the age of the project, and the mailing list information to the notebook and the implementation in config.R. Just to confirm, would checking whether commits have issue ids be in the scope of this issue or should a note be made in the notebook informing the user to parse the project's code git log and use the commit_message_id_coverage function from metric.R?

carlosparadis commented 1 week ago

Let's consider functionality outside OpenHub to be the optional M3 in the interest of scope (and sanity!), so pointer it is. : )

beydlern commented 1 week ago

@carlosparadis May you review my notebook to let me know if there's additional functionality I should consider or if I'm missing something crucial?

carlosparadis commented 1 week ago

I will be honest and say that I need to review your M1 first (my hopes are on the holiday tomorrow to be the buffer I've been wanting to do so to at least catch up to this). On the meantime, however, it seems from your specification you implemented everything Rick wanted.

One thing that would help me a lot to just give you a next step is: Of OpenHub API, how much of it you covered? And what did you not? I recall suggesting you start from the project API given the higher interest in JIRA. But if that is done, what else is available, so we could conclude if anything else would be worthwhile for Kaiaulu userbase would be a good idea.

Other than that, I believe helping @RavenMarQ by distributing the notebook effort for the /exec and the database schema would be a good idea so the group as a whole gets there (there may be too many notebooks in Kaiaulu for a single person to finish, so the task was set as a "the farther you go the better").

Alternatively, there is also the optional M3 which you have an idea of what it looks like in scope from what @daomcgill has been doing. The final considerations of M2 of OpenHub, however, and any other useful feature I would say take priority, and would conclude your M2!

beydlern commented 6 days ago

@carlosparadis I covered the organization-collection, portfolio_projects, project, and analysis endpoints. Other endpoints that may be of interest:

factoid: A Factoid is a short, high-level bullet point delivering a simple observation about a specific project. (Factoids are also stored in the analysis endpoint).
activity_fact: Pre-computed collection of statistics about a project's source code. It summarizes changes to lines of code, commits, and contributors in a single month.
size_fact: Pre-computed collection of statistics about a project's source code. It provides monthly running totals of lines of code, commits, and developer effort.

These 3 (technically 2 because factoids are also in the analysis endpoint) are the only other relevant endpoints that I have not covered. Let me know if any of these endpoints contain interesting data. In my opinion, activity_fact and size_fact are too specific, and the analysis endpoint provides enough relevant high-level statistics (e.g. twelve_month_commit_count and twelve_month_contributor_count). The factoid endpoint's data already exists in the analysis endpoint, but I included it here for the sake of completeness as I'd like to know if you want the factoids to be displayed for each project.

If you consider my M1 and M2 good to go after this, assisting Raven sounds good.

carlosparadis commented 6 days ago

Thank you this is exactly what I needed. I took a look on the docs and the only one that seem potentially useful is:

https://github.com/blackducksoftware/ohloh_api/blob/main/reference/language.md

Could you confirm analysis contains this information? What I am looking for is the language % of a project. Just knowing one is major is not sufficient to consider for analysis. Say, if a project is 51% java, that's half of a system we are not really observing. Also, may want to confirm this language endpoint contains what I think it does.

carlosparadis commented 6 days ago

As for the next step: OK to proceed helping with M2. Make sure your contributions are your own commits and code reviews.

beydlern commented 6 days ago

@carlosparadis

Could you confirm analysis contains this information? What I am looking for is the language % of a project. Just knowing one is major is not sufficient to consider for analysis. Say, if a project is 51% java, that's half of a system we are not really observing. Also, may want to confirm this language endpoint contains what I think it does.

The language endpoint contains the language information for all projects across OpenHub, it looks like nothing specific to a single project. If you're looking for the language percentage breakdown for a project, it is contained in the analysis endpoint labelled language.

However, I already added this to the notebook, and it is present in the issue specification checklist section as:

code_languages: A language breakdown with percentages for each substantial (as determined by OpenHub, less contributing languages are grouped and renamed as "Other") contributing language in the project's source code.

carlosparadis commented 6 days ago

Sounds good, then I would say pending the revision, that's it for this issue! Please go ahead and move to help on M2 execs for Raven, and the data schema tables interconnection : ) Nice work!