Open andrewpollock opened 1 year ago
I initially thought https://cloud.google.com/bigquery/docs/loading-data-cloud-datastore may be an easy way to achieve this, but our underlying Datastore entity format does not match the OSV schema exactly, so it may not be easily usable.
Having it as a time series day partitioned as well as just latest would be brilliant.
This issue has not had any activity for 60 days and will be automatically closed in two weeks
I took a little bit of a look and another easy path would be loading from gcs... would there be appetite to add another exporter (or to the existing exporter) that generates ndJSON files? Those could then be easily loaded with a bigquery loader from gcs
Interesting. So if I'm reading https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json and http://jsonlines.org/ correctly, this is a matter of emitting a single (or perhaps, per-ecosystem) file with all of the records concatenated and optionally compressed? This doesn't seem like it would be particularly difficult to modify the exporter to do.
So I am doing something like this in my use case and have been for several years already and have had to deal with 2 areas over this time period. Big query you cannot easily change a column type so in schema 1.58 the credits[].type field changed to an array so to credits[].type[] so I had to force the single entries pre 1.58 to an array. So bought the array forward to the newer schema.
The other area has been the "database_specific" and "ecosystem_specific" fields which I serialise to json and load as a string as unclear if this would remain valid for big queruy. Big query constrains characters allowed in column names https://cloud.google.com/bigquery/docs/schemas#column_names and the osv schema does match but I would assume governing names out of scope of osv is a possible challenge. This approach has worked for me otherwise with no issues. Just wante to make sure the constraints on field names was clear.
An interesting conversation today spawned the idea of providing OSV data via a BigQuery Public Dataset.
Apparently there's a path from Cloud Data Store to BigQuery. Something to explore further in the future.