ossf / scorecard

OpenSSF Scorecard - Security health metrics for Open Source
https://scorecard.dev
Apache License 2.0
4.37k stars 481 forks source link

Availability and freshness of data available on Google Storage #978

Closed fridex closed 2 years ago

fridex commented 2 years ago

Is your feature request related to a problem? Please describe.

Hi everyone,

I'm trying to consume scorecards from google storage available at gs://ossf-scorecards/. The last entry I can see is gs://ossf-scorecards/05-17-2021.json according to the date in object key. The date also corresponds to the Date field in scorecard entries available in gs://ossf-scorecards/latest.json.

I would like to ask how often are scorecards refreshed and available to the community.

Thanks in advance for any response.

naveensrinivasan commented 2 years ago

scorecard has moved to Big Query https://github.com/ossf/scorecard#public-data. Please use that. Thanks

fridex commented 2 years ago

scorecard has moved to Big Query https://github.com/ossf/scorecard#public-data. Please use that. Thanks

If I understand it correctly, authentication is required to obtain the dataset. Is there a way to access the dataset without authentication? Or, are there any plans to provide the dataset in an open way? Thanks for your response.

naveensrinivasan commented 2 years ago

scorecard has moved to Big Query https://github.com/ossf/scorecard#public-data. Please use that. Thanks

If I understand it correctly, authentication is required to obtain the dataset. Is there a way to access the dataset without authentication? Or, are there any plans to provide the dataset in an open way? Thanks for your response.

AFAIK this is open and does not require authentication @azeemshaikh38 can you please chime in?

fridex commented 2 years ago

I've tried the example from the README file:

$ bq query --nouse_legacy_sql 'SELECT partition_id FROM
openssf.scorecardcron.INFORMATION_SCHEMA.PARTITIONS WHERE table_name="scorecard"
ORDER BY partition_id DESC
LIMIT 1'
ERROR: (bq) You do not currently have an active account selected.
Please run:

  $ gcloud auth login

to obtain new credentials.

If you have already logged in with a different account:

    $ gcloud config set account ACCOUNT

to select an already authenticated account to use.

Also, tried with Python client:

from google.cloud import bigquery

client = bigquery.Client()
query_job = client.query("""SELECT partition_id FROM openssf.scorecardcron.INFORMATION_SCHEMA.PARTITIONS WHERE table_name="scorecard" ORDER BY partition_id DESC LIMIT 1""")
results = query_job.result()  

with the following exception raised:

Traceback (most recent call last):
  File "./t.py", line 5, in <module>
    client = bigquery.Client()
  File "/home/fpokorny/.local/share/virtualenvs/prescriptions-refresh-job-equYSKS1/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 215, in __init__
    super(Client, self).__init__(
  File "/home/fpokorny/.local/share/virtualenvs/prescriptions-refresh-job-equYSKS1/lib/python3.8/site-packages/google/cloud/client.py", line 316, in __init__
    _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
  File "/home/fpokorny/.local/share/virtualenvs/prescriptions-refresh-job-equYSKS1/lib/python3.8/site-packages/google/cloud/client.py", line 264, in __init__
    project = self._determine_default(project)
  File "/home/fpokorny/.local/share/virtualenvs/prescriptions-refresh-job-equYSKS1/lib/python3.8/site-packages/google/cloud/client.py", line 283, in _determine_default
    return _determine_default_project(project)
  File "/home/fpokorny/.local/share/virtualenvs/prescriptions-refresh-job-equYSKS1/lib/python3.8/site-packages/google/cloud/_helpers.py", line 152, in _determine_default_project
    _, project = google.auth.default()
  File "/home/fpokorny/.local/share/virtualenvs/prescriptions-refresh-job-equYSKS1/lib/python3.8/site-packages/google/auth/_default.py", line 486, in default
    raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started

Thanks for any pointers.

azeemshaikh38 commented 2 years ago

You don't need authentication to access the data. You need it so that in case you exceed the free tier limit, Google Cloud knows which project/account to bill. See https://cloud.google.com/bigquery/public-data.

The first terabyte of data processed per month is free, so you can start querying public datasets without enabling billing.

You need to log into a Google Cloud project/account before running the above queries.

fridex commented 2 years ago

@naveensrinivasan @azeemshaikh38 Thanks for the pointers and help. The querying indeed works, it would be slightly easier for us to use GS in the deployment though.

We use scorecard in project Thoth - the Python cloud resolver. Examples for flask or tensorflow. Please let us know if you want to combine efforts in some way.

azeemshaikh38 commented 2 years ago

Very happy to know that project Thoth is using Scorecard. If you prefer GS, you can export the BQ data into GS using instructions here: https://cloud.google.com/bigquery/docs/exporting-data. The APIs also allow you to setup a job which regularly does this export for you.

Happy to collaborate more. We have a bi-weekly Scorecard sync on the OpenSSF calendar - https://calendar.google.com/calendar/u/0?cid=czYzdm9lZmhwNWk5cGZsdGI1cTY3bmdwZXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ. We can discuss possible collaboration efforts during that meet if you choose to join any of the upcoming meets.