instructlab / taxonomy

Taxonomy tree that will allow you to create models tuned with your data
Apache License 2.0
172 stars 598 forks source link

Proposal for automated checks of knowledge submission #658

Closed katesoule closed 1 week ago

katesoule commented 6 months ago

In order to release knowledge contributions to open source community, IBM legal is asking us to propose an automated solution for our generation pipeline that covers:

On the provided knowledge source from the contributor:

On the synthetic data created from the knowledge source

lehors commented 6 months ago

What's HAP? I proposed a PR template for the taxonomy repo (see taxonomy PR 27) and was considering adding a question along the lines of:

Can you confirm that your contribution is original, not based on data from another system, and not encumbered by any copyright or other IP restrictions?

But I'm sure a lawyer would be able to get the right words in. Obviously this isn't bulletproof but it might worth adding?

katesoule commented 6 months ago

HAP is Hate, Abuse and Profanity. We will want this actually tested when people are submitting long documents that can't be fully read by the reviewers, and not just have people check a box.

katesoule commented 6 months ago

Legal is coming up with the right wording, it will most likely be added through a CLA.

lehors commented 6 months ago

HAP is Hate, Abuse and Profanity. We will want this actually tested when people are submitting long documents that can't be fully read by the reviewers, and not just have people check a box.

Thanks. Speaking of which, shouldn't we set some scope for what kind of contributions would be welcome? For one thing, I assume, for now at least, we are limiting contributions to English language, right? I think we should make that clear in the documentation so that people are aware.

lehors commented 6 months ago

Legal is coming up with the right wording, it will most likely be added through a CLA.

CLAs are a huge deterrent to contributions...

darrellreimer commented 6 months ago

@katesoule I had been thinking we would be doing these checks on the server side, not in the cli so this would be transparent to users of lmdk. We (the ones building the models and taking knowledge in) would run hap/pii/copyright etc, and provide a reference implementation to show how and what we're doing.

The fact that IBM legal is recommending we do this isn't really visibly to anyone - it's just a reasonable practice for us

xukai92 commented 6 months ago

second to @darrellreimer seems too much for the LMDK/CLI side

hickeyma commented 5 months ago

@katesoule Following the feedback from @darrellreimer and @xukai92, can this issue be closed as not applicable to CLI?

tuhinsharma121 commented 5 months ago

yeah. second to @darrellreimer

anik120 commented 5 months ago

Putting this on hold since it's unclear if there's anything that needs to be done in the cli. My initial read is "no".

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 31 days.

juliadenham commented 1 week ago

Closing as we now work with Red Hat legal.