responsible-ai-collaborative / aiid

The AI Incident Database seeks to identify, define, and catalog artificial intelligence incidents.
https://incidentdatabase.ai
Other
163 stars 34 forks source link

Mass publication of CSET taxonomy annotations #2926

Open mh2171 opened 1 week ago

mh2171 commented 1 week ago

Hi there,

CSET would like to publish a large batch of incident annotations from the CSETv1 taxonomy at once. This involves changing one field ("Publish") from 'no' to 'yes' for ~250 incident annotations. Could you help with this big update?

Here are the criteria for publication: We want to publish only annotations that have reached a minimum level of annotation. There is one field we can use to filter: 1.3. Annotation Status. There are 6 possible responses to this field

  1. Annotation in progress
  2. Initial annotation complete
  3. In peer review
  4. Peer review complete
  5. In quality control
  6. Complete and final We'd like annotations with status 2 and above to be published. Annotations with a missing value or level 1 should not be published.

Preference should be given to reconciled incident annotations. This means that where available, only the CSETv1 Taxonomy Classifications should be published, and not the corresponding individual annotations (those belonging to CSETv1-Annotator-1, CSETv1-Annotator-2, and CSETv1_Annotator-3).

When the annotations have not been reconciled, meaning there is no CSETv1 annotation for a given incident number, or when its annotation status does not meet our criteria, the CSETv1-Annotator-1, CSETv1-Annotator-2, and CSETv1_Annotator-3 annotations linked to the incident should be published instead, provided they have at least status 2 or above in field 1.3.

Thanks! Mia

kepae commented 1 week ago

Hey Mia, this is great. Thanks for all of the details!

@pdcp1 do you have cycles for this? Because we're likely doing some string comparisons to get which classifications to publish, we should first query and group the CSETv1 annotations + _Annotator annotations by the above status, then write the migration that flips them to "publish".

pdcp1 commented 1 week ago

@ mh2171 Could you please confirm if these numbers make sense to you? If I understood correctly, these are the items that we should set "publish: yes". Is that correct? or, do we have to publish any other documents?

Annotation Status Namespace Publish Quantity
1. Annotation in progress CSETv1 no 0
2. Initial annotation complete CSETv1 no 0
3. In peer review CSETv1 no 7
4. Peer review complete CSETv1 no 119
5. In quality control CSETv1 no 3
6. Complete and final CSETv1 no 17
kepae commented 1 week ago

Thanks @pdcp1. This helps.

Question for @mh2171 -- would you like us to publish the CSETv1-Annotator- data under that namespace, where appropriate? Or, should we move those annotations to the standard CSETv1 namespace and publish them there?

I can understand wanting to keep them officially in the Annotator collections since they are intermediate, but we have some logic that hides these classifications from the site in some places. On the other hand, moving them officially under CSETv1 would unify them with the collection for data analysis and download.

mh2171 commented 1 week ago

@pdcp1: Yes, all of these should be published. Technically, we would not want those with level 1. Annotation in progress to be published, but since there are 0 for the CSETv1 namespace it doesn't make a difference here.

@kepae: Generally it would be nice to keep the _Annotator annotations separated, seeing as they're not technically done. But I can see that that would be more complicated for you. For some incidents there are going to be 2 _Annotator annotations, I'm not sure how many incidents those are. How would you handle moving them under the CSETv1 namespace then?

pdcp1 commented 1 week ago

@mh2171 Here are the more detailed quantities for each namespace.

CSETv1

Annotation Status Namespace Publish Quantity
1. Annotation in progress CSETv1 no 0
2. Initial annotation complete CSETv1 no 0
3. In peer review CSETv1 no 7
4. Peer review complete CSETv1 no 119
5. In quality control CSETv1 no 3
6. Complete and final CSETv1 no 17
Total 146

CSETv1_Annotator-1

Annotation Status Namespace Publish Quantity
1. Annotation in progress CSETv1_Annotator-1 no 0
2. Initial annotation complete CSETv1_Annotator-1 no 212
3. In peer review CSETv1_Annotator-1 no 12
4. Peer review complete CSETv1_Annotator-1 no 72
5. In quality control CSETv1_Annotator-1 no 0
6. Complete and final CSETv1_Annotator-1 no 1
Total 297

CSETv1_Annotator-2

Annotation Status Namespace Publish Quantity
1. Annotation in progress CSETv1_Annotator-2 no 8
2. Initial annotation complete CSETv1_Annotator-2 no 51
3. In peer review CSETv1_Annotator-2 no 31
4. Peer review complete CSETv1_Annotator-2 no 16
5. In quality control CSETv1_Annotator-2 no 0
6. Complete and final CSETv1_Annotator-2 no 0
Total 106

CSETv1_Annotator-3

Annotation Status Namespace Publish Quantity
1. Annotation in progress CSETv1_Annotator-3 no 7
2. Initial annotation complete CSETv1_Annotator-3 no 50
3. In peer review CSETv1_Annotator-3 no 2
4. Peer review complete CSETv1_Annotator-3 no 21
5. In quality control CSETv1_Annotator-3 no 0
6. Complete and final CSETv1_Annotator-3 no 0
Total 80
mh2171 commented 6 days ago

So as a primary rule for every namespace the following holds: Do not publish when Annotation Status = 1. Annotation in progress or missing (there don't seem to be any with a missing annotation status). Publish otherwise.

As a secondary rule, for namespaces with _Annotator the following holds: Do not publish when for the given incident number there exists a CSETv1 annotation for which rule #1 holds. Otherwise, publish all corresponding _Annotator annotations that meet the first rule.

I expect there are going to be incidents for which there is no CSETv1 annotation and multiple _Annotator annotations. This is why simply moving them to the CSETv1 namespace is not as straightforward, because we would need to decide which _Annotator annotation takes priority. I'm guessing there are ~25 of those, but I am not sure.

kepae commented 6 days ago

Thanks @mh2171. There's no technical problem migrating particular annotations to the final namespace of CSETv1. As you point out, the problem is where there are multiple -Annotator annotations that compete with one another. :-)

My only real product concern is that publishing the -Annotator classifications "as-is" alongside the final CSETv1 taxonomy, GMF taxonomy, and others may create unwanted confusion or cheapen how the -Annotator annotations appear as "final" in incident pages and others. For example, Incident 22 already has both CSETv0 and v1 classifications applied. I believe showing the -Annotator data would be redundant and confusing to users of the CSETv1 and overall incident data. That said, I can see the case for wanting users to have the data available in the database downloads.

How about for now, we

  1. Publish all CSETv1 annotations with "Annotation Status" 2-6, without delay;
  2. Separately, reconcile which particular -Annotator classifications we want to make available, especially where they contradict;
  3. And determine whether to migrate/copy them to CSETv1 as final or find a good way to make them available as -Annotator taxonomy data. For example, we could simply hide the -Annotator taxonomies from all interfaces except data downloads.
mh2171 commented 4 days ago

I completely agree with what you're saying, and I think we are aiming for the same thing. I may have been unclear earlier. We only want to publish the _Annotator annotations where there isn't already a final CSETv1 annotation. For incidents that have a final CSETv1 annotation we don't want the _Annotator annotations out there because, as you say, they are not finally reviewed, therefore of lower quality and would lead to confusion when they are contradicting with the CSETv1 annotation.

However, since there are ~100 annotations that only have _Annotator annotations (that don't have a finalized CSETv1 annotation), and this represents around 1/3 of all our annotations, it would seem like a big loss to not publish those. Of this batch, most will only have one _Annotator annotation which can be published as is. Some will have two _Annotator annotations. Ideally we'd like all of them out (provided annotation status is 2-6), but if you prefer to migrate them to the CSETv1 namespace first I'd need to review which of the duplicate _Annotator annotations is kept and which is dropped. I would need a list of the incident numbers for which this applies though (i.e. the incidents without a CSETv1 annotation and with multiple _Annotator annotations with annotation status 2-6).

mh2171 commented 4 days ago

And I very much like option 3. hide the -Annotator taxonomies from all interfaces except data downloads!