Figure out how to publish the results

ossf / package-feeds

Feed parsing for language package manager updates

Apache License 2.0

71 stars 24 forks source link

Figure out how to publish the results #25

Open dlorenc opened 3 years ago

dlorenc commented 3 years ago

Right now things are in individual GCS objects, formatted as JSON. This is easy to look at and browse, but probably not the best for querying.

We could load these into bigquery, some other database, or publish sqlite dumps or something. Whatever is useful to people!

Chime in here if you have ideas for things to do with this data, and can think of formats we can publish in that would be useful for you.

g-k commented 3 years ago

Bigquery would be great since the PyPI dataset is in there. JSON available via CDN would be awesome too. Are there any rules or licensing around use of the data (e.g. how would you like to be attributed)?

And more generally, thanks for working on this! I was talking to Jordan and Ashish from GATech towards the end of last year and did some similar work on the Mozilla Dependency Observatory. So it's great to see you all really running with this, since it's long overdue.

dlorenc commented 3 years ago

Let me check on the data licensing! I think we're planning on one of these two: https://cdla.dev/

I'd probably lean toward the permissive one. Would that work for you?

g-k commented 3 years ago

Let me check on the data licensing! I think we're planning on one of these two: https://cdla.dev/

I'd probably lean toward the permissive one. Would that work for you?

I think so, I'll confirm with legal internally. Thank you!

g-k commented 3 years ago

To follow up: yes, either CDLA license will work for our initial internal use cases. I'm to check back with legal if we make the data public or start modifying it since that carries additional considerations.