snyk-labs / pysnyk

A Python client for the Snyk API.
https://snyk.docs.apiary.io/
MIT License
85 stars 116 forks source link

[BUG]: get_rest_pages not breaking loop when next = self or data.len() = 0 #212

Closed IanYoung-BO closed 8 months ago

IanYoung-BO commented 8 months ago

Is there an existing issue for this?

Description of the bug

When looping through my orgs to retrieve issues, it appears that the client gets to the end and keeps looping, adding nothing each time.

Steps To Reproduce

This block of Python code caused an endless loop.

# Create a Snyk client
client = snyk.SnykClient(snyk_token, version=rest_version, url=rest_url)

# Get all organizations
# orgs = pd.DataFrame(client.get_rest_pages("/orgs"))
orgs = pd.json_normalize(client.get_rest_pages("/orgs"))

# Reinex the DataFrame
orgs.set_index("id", drop=False, inplace=True)

# Display the number of organizations
print(orgs.shape[0].__str__() + " organizations found")

issues = None
for org in orgs.itertuples():
    # Get all issues for the organization
    org_issues = pd.json_normalize(client.get_rest_pages(f"/orgs/{org.Index}/issues"))

    # Display the number of issues found for this organization
    print(org_issues.shape[0].__str__() + " issues found for " + org.Index)

    # If the issues dataframe is empty
    if issues is None:
        issues = org_issues
    else:
        issues = pd.concat([issues, org_issues], ignore_index=True)

Setting the logger to DEBUG reveals the get_rest_pages function running repeatedly.

DEBUG:GET_REST_PAGES: Added another 0 items to the response
DEBUG:GET_REST_PAGES: Another link exists: /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:GET: https://api.snyk.io/rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:Starting new HTTPS connection (1): api.snyk.io:443
DEBUG:https://api.snyk.io:443 "GET /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta HTTP/1.1" 200 402
DEBUG:GET_REST_PAGES: Added another 0 items to the response
DEBUG:GET_REST_PAGES: Another link exists: /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:GET: https://api.snyk.io/rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:Starting new HTTPS connection (1): api.snyk.io:443
DEBUG:https://api.snyk.io:443 "GET /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta HTTP/1.1" 200 402
DEBUG:GET_REST_PAGES: Added another 0 items to the response
DEBUG:GET_REST_PAGES: Another link exists: /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:GET: https://api.snyk.io/rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:Starting new HTTPS connection (1): api.snyk.io:443
DEBUG:https://api.snyk.io:443 "GET /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta HTTP/1.1" 200 402
DEBUG:GET_REST_PAGES: Added another 0 items to the response
DEBUG:GET_REST_PAGES: Another link exists: /rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:GET: https://api.snyk.io/rest/orgs/{org_id}/issues?limit=10&starting_after=eyJzIjoiIn0%3D&version=2023-12-21~beta
DEBUG:Starting new HTTPS connection (1): api.snyk.io:443

Additional Information

Before I started testing the pysnyk client, I was making requests directly using the requests library. I ran across this issue and was able to resolve it by adding a condition in my pagination loop.

def make_api_request(endpoint):
    """
    This function makes a GET request to the specified endpoint.
    It handles pagination by following the 'next' link in the response until it no longer exists.
    The response data is returned as a pandas DataFrame and saved to a file for debugging purposes.
    """
    # Determine if url contains existing parameters
    if "?" in endpoint:
        full_url = urljoin(rest_url, endpoint + "&version=" + rest_version)
    else:
        full_url = urljoin(rest_url, endpoint + "?version=" + rest_version)

    # Make the API request
    response = requests.get(full_url, headers=headers)

    # Validate the response json

    # Write data to file
    with open(data_path +  endpoint.replace("/", "-")+ "-" + datetime.now().strftime("%Y%m%d-%H%M%S") + "-0.json", "w") as f:
        f.write(json.dumps(response.json()))

    # Create a DataFrame from the normalized response data
    df = pd.json_normalize(response.json()["data"])

    # If there are links in the response body
    if response.json()["links"]:
        if "self" in response.json()["links"]:
            self_url = api_root + response.json()["links"]["self"]

        # Parse response body links to determine if there are more pages
        if "next" in response.json()["links"]:
            next_url = api_root + response.json()["links"]["next"]

            iteration_index = 1
            while next_url:
                next_response = requests.get(next_url, headers=headers)

                # If the data is empty, break out of the loop (redundant check)
                if "data" in next_response.json():
                    if next_response.json()["data"].__len__() == 0:
                        break
                else:
                    break

                # Write data to file
                with open(data_path + endpoint.replace("/", "-") + "-" + datetime.now().strftime("%Y%m%d-%H%M%S") + f"-{iteration_index}.json", "w") as f:
                    f.write(json.dumps(next_response.json()))

                # Create a new DataFrame from the normalized next response data
                next_response_df = pd.json_normalize(next_response.json()["data"])

                # Merge the new DataFrame with the existing DataFrame
                df = pd.concat([df, next_response_df], ignore_index=True)

                if "next" in next_response.json()["links"]:
                    # Make sure the next url is unique
                    if next_response.json()["links"]["next"] == next_url.split(api_root)[1]:
                        break
                    else:
                        next_url = api_root + next_response.json()["links"]["next"]
                else:
                    # If there is no next url, break out of the loop
                    break

                # Increment the iteration index
                iteration_index += 1

            # Delete the next response DataFrame
            next_response_df = None

    if response.status_code == 200:
        # Reindex the DataFrame
        df.reset_index(drop=True, inplace=True)

        # Return the DataFrame
        return df
    else:
        raise Exception(f'API request failed with status {response.status_code}: {response.text}')

Now, I'm converting everything to Pandas DataFrames before returning, but the same idea applies. I'm checking for the following

This has rectified the issue for me, and one or more of these checks should fix the get_rest_pages function here.

IanYoung-BO commented 8 months ago

I do want to note that I am using version 2023-12-21~beta of the REST API. This is primarily due to the lack of issues reporting in the release versions. I realize that this issue could be on the API side, as the endpoint has not been fully released yet. It may not be handling the pagination properly when responding to requests. However, adding in some simple error handling will improve the resiliency of this client, and enable more users to work with beta endpoints.

garethr commented 8 months ago

Fixed in Thanks, I've fixed up a formatting issue and merged in https://github.com/snyk-labs/pysnyk/commit/47ce10a784d01477f95435c9c30a4127ed61c292. Much appreciated.