hil-se / fairness

0 stars 0 forks source link

Funding #10

Open azhe825 opened 2 years ago

azhe825 commented 2 years ago

https://www.nsf.gov/pubs/2022/nsf22526/nsf22526.htm?WT.mc_ev=click&WT.mc_id=USNSF_33&utm_medium=email&utm_source=govdelivery

azhe825 commented 2 years ago

Review 1 Rating:

Fair

Review: Summary

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to intellectual merit.

Proposal summary: This application proposes to develop a machine learning model to detect human bias in decision-making, its degree and direction, and which human decisions are biased (incorrectly labeled data points). Prototype software will be developed that implements the MLM and provides feedback to humans on potential bias in their decisions

Strengths: • Development of a MLM that automatically detects bias in human decision-making is a worthy goal • It is possible to test, with absolute certainty, the performance of the MLM when using synthetic data because the ground truth of inject bias is known.

Weaknesses: • This proposal often reads as circular. After reading it several times, I can see where results of the work might be useful. • This is a problematic statement: “Bias exists in human decisions if-and-only-if the machine learning model trained on these decisions is measured as biased in terms of the fairness metric.” There are no circumstances where human decisions won’t have bias. The proposal should have included discussion about thresholds of usefulness of the fairness metric for applied application • The proposal hinges on the assumption of “correct” labels and the ability to know or detect ground truth of these correct labels. • Section 7.1.3 states a challenge of using real-world data is keeping the test set unbiased. Indeed, the likelihood that real-world data are unbiased is vanishingly small if ever • In the same section the proposal suggests that with great enough effort, domain experts can create unbiased labeling of data. However, in section 5.3 Human Decision Fairness, the proposal discusses inter-annotator agreements to estimate ground truth using a plurality of annotators to obtain reliable labeling of data. This is not the same as unbiased data • Section 7.1.3 Real World Biased Datasets is a failure for this proposal as it makes the assumption that the PI knows the ground truth of bias in decision-making for members the admission and search committees by virtue of membership on those committees. The size of the data sets for student admission and faculty search data are not disclosed. • Task 3 states that “Future bias can be avoided or alleviated”. “Alleviated” is problematic and so is the word “eliminate” from page 5. Ultimately, the algorithm will be compared to estimates of ground truth, not ground truth. • Task 6 is under-described and vague. It amounts to “we will develop software and find someone with data to test it”. That isn’t really a plan. Identifying actual real-world partners and describing the software development approach would have strengthened the proposal. Children’s Institute and the T-CRS scale is mentioned later but the description is still vague.

Human & Vertebrate Subjects (if applicable): • Appropriate

Budget: • Appropriate

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to broader impacts.

Strengths: • Lofty goals for six different use cases.

Weaknesses: • The current proposal is basically asking a human to defer to an algorithm implemented in a software prototype and there is no discussion of delivering transparency to end users who are making decisions has to how the system is making recommendations. In this, the software will be telling the end user “You are biased. Use me instead and trust that I am unbiased”. The proposal would have benefited from a user-engagement plan to understand what information should be delivered with the algorithm for transparency in how recommendations/determinations are generated. • The proposal overlooks that some people don’t want to make fair decisions and are intentionally biased. This is quibbling and trivial point but worth mentioning because there are current political and legal efforts nationally and internationally to intentionally skew outcomes for specific groups of people

Postdoctoral Mentoring Plan (if applicable): • N/A

Please evaluate the strengths and weaknesses of the proposal with respect to any additional solicitation-specific review criteria, if applicable

How effectively does the proposal address transformative research responsive to the goals of the program (i.e., fundamental computer science research, incorporating multiple disciplinary perspectives as needed and appropriate to the scope of the work, and contributing to fairness)? • Somewhat contributes to fairness, light on multiple disciplinary perspectives

How effectively does the proposal embed innovations in real systems and address dissemination for broad and timely access to such advances? • Mentions embedding but somewhat vague as to how that will happen

How effectively does the proposal evaluate progress and outcomes? • Task 5 evaluation approach of comparing domain expert evaluations against machine learning algorithm performance is a traditional approach. However, what if the algorithm never outperforms existing baseline methods as is the stated requirement for success? • Task 6 not well described

Summary Statement

Responsiveness to the Solicitation: • Responsive

Justification for NSF Rating: • Proposal has some weakness but the proposed topic area is of interest and could be bettered developed for future proposals. Some areas of the proposal are difficult to read

Review 2 Rating:

Good

Review: Summary

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to intellectual merit.

+Strength The proposed research is novel in that it aims to develop a strategy to utilize ML model biases to inform humans of biases, which will in turn help human make decisions with less or no bias. The approach contrasts the bulk of existing research that focuses directly on detecting and mitigating the bias in ML algorithms. The research team further proposes to develop cloud-base prototype software that implements the algorithm using real world applications.

-Weakness The proposed algorithm will be validated on real-world data with bias. It is unclear how the research team make sure that the real-world data annotated by the Amazon Mechanical Turk (AMT) crowdsourcing workers has bias as intended. It is also unclear how to make sure that the group of workers who annotate the data initially and that who validate the proposed strategy are homogeneous. Further, while the proposed strategy relies on domain expert’s opinion to keep the real-world test set unbiased, who these experts will be and how to identify/recruit these experts are not described.

Evaluation Plan: Evaluation plan is not adequate due to the lack of details on the strategies to identify/recruit domain experts.

Collaboration Plan: No issues identified.

Data Management: No issues identified.

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to broader impacts.

+Strength The proposed strategy is validated using at least one real-world data, and it is plausible that the proposed strategy can work in other fields. The project, if successful, can transform the way AI/ML systems are used in any type of human decision making while mitigating bias and discrimination in human/ML-based decision making. With the proposed cloud-base prototype software, the research team could effectively disseminate the developed strategy not only in academia but to a wider audience.

-Weakness The proposal is silent about how the project effectively accommodates underrepresented students in their research.

Please evaluate the strengths and weaknesses of the proposal with respect to any additional solicitation-specific review criteria, if applicable

+Strength The research considers various fields for applications of the developed strategy.

-Weakness A need of inputs from domain experts are stated but specific involvement of such experts is not described in the proposal.

Summary Statement

The proposed research has a novel approach in mitigating the impact of bias generated by a ML-based system by utilizing the detected bias as an input to human decision making. If successful, this will bring a paradigm shift in the way that AI-system interacts with human decision making. Potential contribution of the research to the society is thus significant. The proposal could have been improved further if specific domain experts are already identified and included in the research.

Human & Vertebrate Subjects (if applicable): N/A

Postdoctoral Mentoring Plan (if applicable): N/A

Budget: Looks appropriate.

Results from prior NSF support (if applicable): No issues identified.

Review 3 Rating:

Fair

Review: Summary

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to intellectual merit.

Strengths A key aim of the proposal is to enable AI systems to identify when humans might make biased decisions. The focus on annotator performance is an area that warrants examination.

The proposed project might build on the team's ongoing FairBalanceClass work and efforts to develop fairness metrics.

The proposal would bring together an interdisciplinary team with experience in ML and psychology. The expertise from the realm of human factors may be particularly helpful in increasing the likelihood of project success.

The team would likely have access to sufficient facilities to undertake the project.

Weaknesses It is difficult to evaluate the merits of the team's approach without more specific information about the datasets that will be studied. For example, the proposal describes some categories of "Unbiased Datasets" but it is not directly indicated which types might actually be used. The sources and the quality of the "real world biased datasets" are to some degree unknown based on what is stated in the proposal. Because of that, it is difficult to assess the likelihood of success in acquiring the data the team wants to analyze.

It is unclear from the proposal whether the team leadership has an established track record of collaboration and thus difficult to assess how well team members might work together. A fuller, more detailed collaboration plan might have helped to address that concern.

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to broader impacts.

Strengths If successful, project findings could have important implications for many sectors including finance and criminal justice. A reliable method for detecting potential human bias could have widespread applications.

The team will seek to develop a tool that could help at-risk school children, which if successful, would be an important contribution.

Weaknesses The proposal mentions that the team will seek to develop collaborations with industry but it is not clear what the likelihood of those collaborations are and with whom.

Without more directly specifying which organizations the team might reach out to, besides the Children’s Institute, it is difficult to assess the extent to which the project might tangibly benefit particular communities.

There is rather little detail in the broader impacts sections regarding specific deliverables, such as publications or presentations, and whether and how this project might impact educational efforts.

Please evaluate the strengths and weaknesses of the proposal with respect to any additional solicitation-specific review criteria, if applicable

Strengths The DMP is largely adequate; it describes some of the main data types and storage systems.

The collaboration plan describes the individual roles of the team leadership.

Weaknesses The evaluation plan is fairly limited in detail; the team will rely on domain experts for some of the relevant tasks but the possible evaluation measures/metrics should be more directly specified. The proposal also indicates that success might be measured by whether industry partners are satisfied with fairness improvements but that is difficult to assess without more context, including who those partners might be.

If the team intends to collect "qualitative survey results and other human-originated data", then a fuller description of data privacy protections should have been included in the DMP.

(N/A - no postdoc mentoring plan)

Summary Statement

The approach described in the proposal has some promise as a means for mitigating human bias but key components of the team's research plan lack sufficient detail including which industry partners might benefit from the project's findings.

Review 4 Rating:

Multiple Rating: (Good/Fair)

Review: Summary

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to intellectual merit.

The proposed effort seeks to reduce bias from human decision in human-centered systems. This application has 3 focus areas: detect bias in the human decisions, inform human decision maker the degree and direction of bias, correct biased decisions.

Strengths:

In the context of the five review elements, please evaluate the strengths and weaknesses of the proposal with respect to broader impacts.

Strengths:

Weaknesses:

Please evaluate the strengths and weaknesses of the proposal with respect to any additional solicitation-specific review criteria, if applicable

Transformative: Yes - detecting and reducing human bias in human-centered system is very important

Embed Innovation in Real System: No through plan other than targeting application Domains

Evaluation plan: The evaluation plan is reasonable

Summary Statement

The application seeks to address an important problem of understanding and mitigating bias in human-centered system. The approach is sound and innovative. However, the proposal does not offer a reasonable plan or strategy for embedding the innovation into real systems. The broader impact also lacks specificity.

azhe825 commented 2 years ago

https://www.rit.edu/research/srs/staff/scott-miller