google / osv.dev

Open source vulnerability DB and triage service.
https://osv.dev
Apache License 2.0
1.55k stars 190 forks source link

Provide a BigQuery Public Dataset of OSV data #889

Open andrewpollock opened 1 year ago

andrewpollock commented 1 year ago

An interesting conversation today spawned the idea of providing OSV data via a BigQuery Public Dataset.

Apparently there's a path from Cloud Data Store to BigQuery. Something to explore further in the future.

oliverchang commented 1 year ago

I initially thought https://cloud.google.com/bigquery/docs/loading-data-cloud-datastore may be an easy way to achieve this, but our underlying Datastore entity format does not match the OSV schema exactly, so it may not be easily usable.

MikeMoore63 commented 7 months ago

Having it as a time series day partitioned as well as just latest would be brilliant.

github-actions[bot] commented 4 months ago

This issue has not had any activity for 60 days and will be automatically closed in two weeks

katzj commented 1 week ago

I took a little bit of a look and another easy path would be loading from gcs... would there be appetite to add another exporter (or to the existing exporter) that generates ndJSON files? Those could then be easily loaded with a bigquery loader from gcs

andrewpollock commented 6 days ago

Interesting. So if I'm reading https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json and http://jsonlines.org/ correctly, this is a matter of emitting a single (or perhaps, per-ecosystem) file with all of the records concatenated and optionally compressed? This doesn't seem like it would be particularly difficult to modify the exporter to do.

MikeMoore63 commented 6 days ago

So I am doing something like this in my use case and have been for several years already and have had to deal with 2 areas over this time period. Big query you cannot easily change a column type so in schema 1.58 the credits[].type field changed to an array so to credits[].type[] so I had to force the single entries pre 1.58 to an array. So bought the array forward to the newer schema.

The other area has been the "database_specific" and "ecosystem_specific" fields which I serialise to json and load as a string as unclear if this would remain valid for big queruy. Big query constrains characters allowed in column names https://cloud.google.com/bigquery/docs/schemas#column_names and the osv schema does match but I would assume governing names out of scope of osv is a possible challenge. This approach has worked for me otherwise with no issues. Just wante to make sure the constraints on field names was clear.