postman-open-technologies / openapi-github-search

OpenAPI GitHub Search (GSoC 2023)
Apache License 2.0
9 stars 3 forks source link

List the challenges of using the GitHub search API to search for OpenAPI definitions #1

Open MikeRalphson opened 1 year ago

MikeRalphson commented 1 year ago

Think about:

Resources:

money8203 commented 1 year ago

Some potential challenges of using the GitHub search API to search for OpenAPI definitions could include:

bshreyasharma007 commented 1 year ago

List of challenges which one may encounter while using the GitHub search API to search for OpenAPI definitions

Scope and Size of the Data: GitHub is a vast repository of code, and the search API can return a large number of results, making it challenging to find the specific OpenAPI definitions you are looking for. There is a limit to the number of repositories a query will search through. The REST API will find up to 4,000 repositories that match your filters and return results from those repositories. One cannot use queries that are longer than 256 characters (not including operators or qualifiers) or have more than five AND, OR, or NOT operators.

There is a limit on limit how long any individual query can run. The API returns the matches that were already found prior to the timeout, and the response has the incomplete_results property set to true.

Reaching a timeout does not necessarily mean that search results are incomplete. More results might have been found, but also might not.

GitHub search documentation states that authentication is not required for performing searches on public repositories. However, for searching private repositories, authentication using an access token or OAuth is required. You need to successfully authenticate and have access to the repositories in your search queries, otherwise, you'll see a 422 Unprocessable Entry error with a "Validation Failed" message.

Q) What is searched and what is not ? The GitHub API is used to search for specific data types such as code, repositories, issues, pull requests, and commits. When using the GitHub API, the following are searched:

Code: text match metadata for the file content and file path fields when you pass the text-match media type.

Only the default branch is considered when searching for code in repositories. Only files smaller than 384 KB are searchable when searching for code.

Commits: Find commits via various criteria on the default branch (usually main).

Get text match metadata for the message field when you provide the text-match media type.

Issues and pull request: When searching for issues, text match metadata is only available for the issue title, issue body, and issue comment body fields.

Repositories: The API can search for repositories based on the name, description, and other attributes.

Commits: The API can search for commits based on various criteria, such as author, committer, date, and more.

You must always include at least one search term when searching for code.

The rate limit for unauthenticated requests is limited to 10 requests per minute, while authenticated requests have a higher limit of up to 30 requests per minute.

Q) Are any languages better or worse suited to implementing a GitHub search?

Github Search is based on Github API which can be used on multiple language, so any programming language that can make HTTP requests and parse JSON to interact with the GitHub API should work.

Q) what search terms would you use? (Can you find OpenAPI definitions manually first?) I will try to use use keywords related to OpenAPI such as "openapi", "swagger", "openapi.yaml", "openapi.json", "swagger.yaml", "swagger.json", etc. Additionally, I would try to narrow down the search to specific repositories or organizations that are likely to contain OpenAPI definitions.

We can try to find keywords by searching manually in search bar of Github, some of the repositories may have open api tag.

When we try searching manually we find that just the keyword does not suffice, but we may have to take consideration of the description of output after search result, but it can be a starting point on optimizing use of Github API to search for Open API definition

Rudrak3 commented 1 year ago

When using the GitHub search API to search for OpenAPI definitions, there are several challenges one may encounter. Firstly, GitHub is a vast repository of code, and the search API can return a large number of results, making it difficult to find the specific OpenAPI definitions you are looking for. Additionally, there is a limit to the number of repositories a query will search through, which is set at 4,000 repositories. The queries also cannot exceed 256 characters in length (excluding operators and qualifiers) and cannot have more than five AND, OR, or NOT operators.

Another challenge is the limit on how long an individual query can run. If a timeout is reached, the API returns the matches that were already found prior to the timeout, and the response has the incomplete_results property set to true. However, reaching a timeout does not necessarily mean that search results are incomplete.

Furthermore, authentication is required for searching private repositories using the GitHub API. Without successful authentication and access to the repositories in your search queries, a "Validation Failed" message will be displayed, along with a 422 Unprocessable Entry error.

Regarding what can be searched, the GitHub API can search for specific data types, such as code, repositories, issues, pull requests, and commits. For code searches, only the default branch is considered, and only files smaller than 384 KB are searchable. Commits can be searched based on various criteria, such as author, committer, date, and more, and text match metadata is available for the message field. When searching for issues and pull requests, text match metadata is only available for the title, body, and comment body fields.

In terms of programming languages, GitHub API supports many progrmming language but if we opt for a language with better community/developer support like JavaScript and Python then associated library used to connect with Github API may be having much ease of use and more features.

To search for OpenAPI definitions, keywords such as "openapi", "swagger", "openapi.yaml", "openapi.json", "swagger.yaml", and "swagger.json" could be used. It may also be helpful to narrow down the search to specific repositories or organizations that are likely to contain OpenAPI definitions.

Before looking at probable keywords we may have to find some keywords manually which are helpful to search Open API definitions.

ishaan812 commented 1 year ago

Some of the challenges I could think of are:

kaushikc44 commented 1 year ago

Is this required to do for the proposal? Do let me know thanks.

MikeRalphson commented 1 year ago

Is this required to do for the proposal? Do let me know thanks.

@kaushikc44 one of the "good first issues" being completed is a prerequisite for applying to GSoC.

kaushikc44 commented 1 year ago

Got it thanks anyone good first issue or both of them ?

MikeRalphson commented 1 year ago

@kaushikc44 just one of the issues is fine.

kaushikc44 commented 1 year ago
  1. Scope and Size of the data. The GitHub data can only search APIs that are in public repositories, which means the OpenApis stored in Private repositories won't be visible. Additionally, the size of the data can be large which can be quite challenging to fetch and query.

However, the larger the data it would be beneficial, if we go according to my proposal of using langchain , which could be trained in a well-defined manner. Also, we can avoid providing the data in real-time by actually storing it in our DB, and only updating if there is a change to the repositories. In this way we could improve our process.

  1. Authentication Requirements. From the Github Api docs, it is quite evident that non-authenticated users will only be able to make 10 requests per minute. For the users to be authenticated they have to have a valid GitHub account and generate a token, which can be daunting for new users, and can easily be avoided using our AI Model. With just a few simple English sentences they can automatically get a generated token. Thereby avoiding the barrier for new users.

  2. Searched and what is not:

  3. One of the issues that can arise while searching is the queries that can't be more than 256 characters or it can't have more than 5 operators.

  4. Also the scope limit is limited to 4000 repositories so users have to be very specific or have to query multiple times for more data.

  5. Different types of Search can be found 1."Text-match" where we provide the text contained within a repository 2."Text-code" we can search for the code, however, it is similar to "text-match", we can provide the extension of the code such as javascript or py to filter our search in these specific files. This does have some restrictions such as files below 384kb can only be searched. Also if no file is mentioned it will automatically consider the master, thereby user needs to know the specific file's name. 3."Commits" , commits can also be searched using the GitHub API's 4."Issues and pull" requests can also be used for searching. It can also search for historical issues. to find issues and pull it is required to have the tag: issue or pull. 5."Labels", we can even search for labels. However, if we have labels with the same name it would prefer the first latest one.

    1. Repositories, they provide the data for the most famous results with the same name
    2. Topics, most famous results.
    3. Users, this will only output with the users having more than 42 repositories and in descending order the highest repositories with the name will be displayed first 4.3.Different types of Searches can not be found
    4. Binary or audio/image files, as they are stored in binaryarray it is a harder extensive process to convert an image or audio online and search for the binary data.
    5. File Names/Folder Names as many repositories might have the same name or folder for instance "src" highly famous folder name in the javascript program. It will be highly difficult.
  6. what are the functional limitations placed on GitHub API searches? The GitHub has certain API request limitations such as authorized users can only make 30 requests per minute. Whereas non-authorized users can make 10 requests per minute. (I have listed more issues in the search and what not for different types of queries)

  7. are any languages better or worse suited to implementing a GitHub search?: While the GitHub search API can be used with any programming language that can make HTTP requests, some languages may be better suited to implementing a search. For example, Python has many libraries such as fast API that is lightweight and tools for working with APIs and handling JSON data.Also, ExpressJs can be used if users have a preference to only use Javascript.

6.what search terms would you use? (Can you find OpenAPI definitions manually first?) To Search for OpenAPI definitions we could try using various keywords that starts with OpenAPI and extensions. It is also helpful to search for repositories with README files or any file listed with the tag name API's. However not all Api definitions are named or structured.

kaushikc44 commented 1 year ago

@MikeRalphson I have also provided a couple of my suggestions apart from the above questions hope that's alright and encouraged.

MikeRalphson commented 1 year ago

@kaushikc44 absolutely, keep up the good work!

nfonjeannoel commented 1 year ago

Scope and size of the data: One challenge of using the GitHub search API to search for OpenAPI definitions is the size of the data. GitHub hosts millions of repositories, and not all repositories contain OpenAPI definitions. Therefore, it can be time-consuming to search through all repositories to find the relevant OpenAPI definitions.

Authentication requirements: To use the GitHub search API to its fullest, authentication is required. For authenticated requests, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute.

What is searched and what is not? The GitHub search API only searches for files that are indexed by the GitHub search engine. Therefore, not all files within a repository may be indexed, and it may be possible to miss OpenAPI definitions that are not indexed. Additionally, the search API does not search the contents of binary files, such as images or PDFs. Also, only about 400 repositories are searched

Functional limitations placed on GitHub API searches: The GitHub API has functional limitations that can limit search capabilities. For example, the search API only returns up to 1,000 results per search. This limit may be too low if the search query is too broad. Additionally, the API may not be able to handle complex search queries or search parameters.

Are any languages better or worse suited to implementing a GitHub search? The GitHub search API is a RESTful API that can be accessed using any programming language that can make HTTP requests. However, some languages may have better support for making HTTP requests and handling JSON responses, which can make it easier to work with the GitHub API. personally, I find Python better as it is easy for me to use. Javascript is also a good option.

What search terms would you use? (Can you find OpenAPI definitions manually first?) To find OpenAPI definitions, some search terms that can be used include "openapi", "swagger", "api.yaml", "api.json", "openapi.yaml", and "openapi.json".

rohan-kulkarni-25 commented 1 year ago

- the scope and size of the data

I guess the size of the data will depend on the way we search for it. If we are searching for all repositories of a particular user the data size will be large. But if we searched for a specific repository then the size of the data will be small which will help us in searching. We can do this by asking the user to provide the index number of the repository. User can get the number by sending request on API which will contain the list of all repositories.

Authentication requirements

Authentication is compulsory. The major task will be to get the token key of the user and utilize it. We need to do some research on this point regarding how we can make it as easy as possible for the end user. Because when I tried the GitHub API for the first time the most difficult part was to understand the auth.

What is searched and what is not?

We can provide multiple searches which are provided by default from GitHub and also add more of our own which we need while building some projects.

Are any languages better or worse suited to implementing a GitHub search? I think building the project in javascript will be better as major developers use javascript to utilize the API so there will not be compatibility issues.

Nebulaa11 commented 1 year ago

The GitHub search API can be a powerful tool to search for OpenAPI (formerly known as Swagger) files, but it comes with some challenge:

Limited search results: The search API has a limit on the number of search results it returns. If your search query is too broad, you may not get all the OpenAPI files you are looking for.

Incomplete indexing: GitHub may not index all the OpenAPI files on the platform, especially if they are hosted in private repositories or are not labeled with relevant tags or keywords.

Lack of filtering options: The search API has limited filtering options, which can make it difficult to narrow down search results to specific criteria such as version, license, or date modified.

Rate limiting: GitHub limits the number of API requests you can make in a given time period, which can slow down your search or even cause it to fail if you exceed the limit.

Authentication requirements: Access to private repositories or certain files may require authentication, which can make it difficult to search for OpenAPI files across multiple repositories.

Search syntax: GitHub's search syntax can be complex, and it may take some time to learn how to use it effectively to find OpenAPI files.

Overall, while the GitHub search API can be a useful tool for finding OpenAPI files, it is important to be aware of its limitations and to use it in conjunction with other search methods to ensure you are finding all the files you need.

Regarding...............

Scope and size of the data: The scope and size of the data on GitHub can be vast. GitHub is one of the largest code-sharing platforms in the world, with millions of repositories containing source code, documentation, and other files. The scope and size of the data can make it challenging to find the specific OpenAPI files you need.

Authentication requirements: Access to private repositories or certain files may require authentication, which can make it challenging to search for OpenAPI files across multiple repositories.

What is searched and what is not: The GitHub search API searches through code, commit messages, and other text-based content in public repositories. However, it does not search through binary files or files in private repositories unless you have access to them.

Functional limitations: GitHub API searches have some functional limitations, such as rate limiting and incomplete indexing. These limitations can affect the accuracy and completeness of your search results.

Language suitability: The language you choose to implement the GitHub search API depends on your specific needs and the tools you are familiar with. Some languages, such as Python and JavaScript, have GitHub API libraries and may be easier to work with.

Search terms: To search for OpenAPI definitions, you can use specific keywords such as "openapi", "swagger", or "swagger.yml" to narrow down your search results. You can also search for repositories that are labeled with these keywords to find OpenAPI files more easily. Manually searching for OpenAPI definitions can also be helpful in finding the specific files you need.

Vineet-Sharma1927 commented 1 month ago

1. Scope and Size of the Data Vast Repository of Code: GitHub hosts millions of repositories, making the search results highly varied. Diverse File Formats: OpenAPI definitions can exist in multiple formats (YAML, JSON), which complicates precise searching. Data Overload: The extensive data pool can make it difficult to filter out non-relevant results when searching for specific OpenAPI files.

2. Authentication Requirements Rate Limiting: While unauthenticated requests are possible, they are severely limited to 10 requests per minute. Authenticated Access: Using OAuth or personal access tokens increases the request limit to 30 per minute, providing more flexibility and speed in large-scale searches.

3. What is Searched and What is Not? Indexed Data: GitHub's search API searches file contents, names, repositories, and descriptions, but does not index larger files (over 384 KB). Metadata Search Only: It doesn't crawl through non-text file formats or search inside binary files, which could hinder searching OpenAPI definitions stored in uncommon formats.

4. Functional Limitations of GitHub API Searches Pagination of Results: Search results are paginated, meaning retrieving a large number of results takes multiple calls. Limited Queries: Queries are constrained by API limits, with no more than 1,000 results returned for any search. Inconsistent Results: Depending on how the OpenAPI definitions are labeled or stored, the results may vary widely in relevance.

5. Language Suitability Language-Specific Tools: Languages with good HTTP client libraries like Python (with requests or httpx) or JavaScript (with Axios or fetch) are better suited to handling the API’s interaction and authentication. Better for Text Parsing: Languages like Python excel in parsing the JSON responses from GitHub API due to their robust libraries (json, yaml).

6. Search Terms and Manual Discovery Suggested Search Terms: "openapi" OR "swagger" filename:.yml OR filename:.yaml OR filename:*.json" "openapi" in:file path:/api/ OR path:/docs/ Manual Search: Manually exploring well-known repositories in the API space can help you spot common directory structures where OpenAPI definitions reside (e.g., docs/, api/).