jlewi / foyle

AI For Software Operations
https://foyle.io
Apache License 2.0
78 stars 7 forks source link

Feature : add publish data to huggingface (private datasets repository) #85

Open Josephrp opened 3 months ago

Josephrp commented 3 months ago

Issue

the data created is not secured.

Solution

Upload logs with training splits to huggingface with public=false and tags to identify the datasets.

Comment

Hi there, cool repo, i think it's actually the start of a tool chain that will probably become mainstream in many if not workflows, i'm really looking forward to deploying this and seeing how it could potentially be used to serve validation pipelines for contributions to common libraries in our open source community.

jlewi commented 3 months ago

@Josephrp Can you explain what you mean by the data create is not secure? By default the data is stored on your local filesystem. So if you wanted to push that to huggingface you could.

If you put your logs somewhere else (e.g. Datadog, Cloud Logging etc...) then you'd inherit that system's security features. For example, if Cloud Logging you can use IAM to restrict who has access.

Can you describe your use case for HF private datasets?

Josephrp commented 3 months ago

my idea : say you have ten users, well, instead of each having say, 200 datapoints , if they so choose to make their pubished data public (using the template provided) , it will be much easier to discover and aggregate using tags meaning you can get a 2000 datapoint dataset with one line of code , which does help downstream, for example for further data processing or finetuning