ulab-uiuc / research-town

A platform for AI researcher community
http://docs.auto-research.dev
Apache License 2.0
13 stars 3 forks source link

[FEAT]: Human Evaluation for Generation Quality #79

Closed chengzr01 closed 2 months ago

chengzr01 commented 3 months ago

Description

Insightful

Additional Information

No response

chengzr01 commented 3 months ago

Human Evaluation We perform human evaluations that involve assigning a score for each criterion regarding ideas, papers, and reviews. As the generated content are knowledge-intensive, it is crucial to select participants (who are well-versed in the field) and provide them with content that are relevant to their field of expertise.

Participants and Procedure We recruited 10 participants (5 identified as male, 5 identified as female) who have authored at least three papers from an R1 university in the United States. Each participant was provided with a 6-page guideline document, which includes the task instruction and annotation examples. On average, each participant evaluated 3 sets of research papers and reviews, with each set comprising three sub-ideas from three different approaches. They were paied with $20 per hour.

Induced Criteria To align model-based evaluations with human preferences, we induce the criteria (used for automatic evaluations) with actual human judgments.

Criteria for Human Evaluation [TBD]

Types Criteria Texts
Idea Novelty How original and unique is the idea? Does it introduce a new perspective or significant advancement compared to existing methods? How does it align with or diverge from the innovations highlighted in the trend?
Technical Depth Assess the technical rigor of the idea. Does it include solid theoretical foundations, robust algorithms, and detailed methodologies? Is the technical depth in line with the state-of-the-art techniques noted in the trend?
Impact and Significance Evaluate the potential impact of the idea on the ML community and beyond. How significant is its contribution to advancing the field? Does it address high-impact problems or gaps identified in the trend?
Feasibility and Practicality Assess the feasibility of implementing the idea. Is it practically applicable in real-world scenarios? Does it consider efficiency and scalability, in line with the practical application focus of the trend?
Theoretical Foundation and Conceptual Soundness Evaluate the theoretical foundation and conceptual soundness of the idea. Are the underlying principles well-defined and logically consistent? Does the idea demonstrate a deep understanding of relevant theories and concepts? How does it contribute to advancing theoretical understanding within the field?
Clarity and Presentation Assess the clarity, organization, and presentation quality of the idea. Is the idea communicated effectively, adhering to high presentation standards seen in top-tier ML conferences?
Potential for Real-world Applications Evaluate the potential of the idea to be applied in real-world scenarios. How applicable is it in practical settings and industry contexts? Does it address real-world problems or challenges identified in the trend?
Innovation Potential Assess the potential of the idea to inspire further research and innovation within the ML community. Does it open up new avenues for research or provide a novel framework aligning with the emerging trends and future directions of the trend?
Ethical Considerations Consider the ethical implications and societal impact of the idea. Does it adhere to the growing emphasis on ethical AI and responsible ML practices as highlighted in the trend?
Interdisciplinary Connections Evaluate the potential for the idea to connect with and contribute to other disciplines beyond ML. Does it align with the trend of interdisciplinary research and collaboration, integrating with fields such as data science, neuroscience, or social sciences?
Paper Title Appeal Does the title grab attention and generate interest? Is it informative and reflective of the paper's content?
Abstract Quality How well does the abstract summarize the paper? Is it clear, concise, and informative? Does it effectively convey the significance and main contributions of the paper?
Title and Abstract Consistency How well do the title and abstract align with each other? Do they accurately represent the core idea and content of the paper?
Literature Review and Background Assess the thoroughness of the literature review and background provided. Is the context and relevance of the research well-established? Does it cover key works and current trends in the field?
Methodology Evaluate the soundness and appropriateness of the methodology used. Are the research design and methods clearly described and justified? Is the methodology robust and suitable for addressing the research questions?
Results and Analysis Assess the quality and clarity of the results presented. Are the results well-analyzed and interpreted? Do the findings support the claims made in the paper?
Clarity and Presentation Evaluate the clarity, organization, and presentation quality of the paper.
Contribution to the Field Evaluate the significance of the paper's contributions to the field. Does it advance knowledge or offer new insights? How does it compare to existing works in terms of impact?
Ethical Considerations Consider the ethical implications and societal impact of the work. Does it adhere to ethical guidelines and responsible research practices? Are potential negative consequences or biases addressed?
Interdisciplinary Connections Evaluate the potential for the work to connect with and contribute to other disciplines. Does it integrate knowledge from other fields or offer insights relevant to them? How well does it align with the trend of interdisciplinary research and collaboration?
Review Summarization Summarize the paper's motivation, key contributions, and achievements in a paragraph. Whether there are misunderstandings that need to be addressed in their author response.
Strengths Describe the strengths of the work. Typical criteria include soundness of the claims (theoretical grounding, empirical evaluation), significance and novelty of the contribution, and relevance to the community.
Weaknesses Explain the limitations of this work along the same axes as above. Explain the limitations of this work along the same axes as above.
Correctness Are the claims and methods correct? Is the empirical methodology correct? Explain if there is anything incorrect with the paper. Incorrect claims or methodology are the primary reason for rejection. Be as detailed, specific, and polite as possible. Thoroughly motivate your criticism so that authors will understand your point of view and potentially respond to you.
Clarity Is the paper well written? Rate the clarity of the exposition of the paper. Give examples of what parts of the paper need revision to improve clarity.
Relation to prior work Is it clearly discussed how this work differs from previous contributions? Explain whether the submission is written with the due scholarship, relating the proposed work with the prior work in the literature. The related work section should not just list prior work, but explain how the proposed work differs from prior work that appeared in the literature.
Reproducibility Are there enough details to reproduce the major results of this work? Mark whether the work is reasonably reproducible. If it is not, lack of reproducibility should be listed among the weaknesses of the submission.
Impacts and Implications Have the authors adequately addressed the broader impact of their work, including potential negative ethical and societal implications of their work?
Ethical Considerations Does the submission raise potential ethical concerns? This includes methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury.
lwaekfjlk commented 3 months ago

thanks! very comprehensive.

lwaekfjlk commented 3 months ago

@ft2023 probably related to the paper writing part.

Monstertail commented 3 months ago
Fairness
    - Rating (1-10):
    - Comments:
    - Are the review scores distributed fairly?
    - Is there a balance in the scoring, without significant bias towards extremely high or low scores?
    - Do the scores reflect a reasonable and unbiased assessment of the paper?

@chengzr01 I added another point for review based on what you commented before. Now we have 10 dimensions.

  1. Review Summarization

    • Rating (1-10):
    • Comments:
    • Does the review accurately summarize the paper's motivation?
    • Are the key contributions and achievements clearly summarized?
    • Are there any misunderstandings that need to be addressed in the author's response?
  2. Strengths

    • Rating (1-10):
    • Comments:
    • Are the strengths of the work clearly described?
    • Are the claims sound, both theoretically and empirically?
    • Is the contribution significant and novel?
    • Is the work relevant to the community?
  3. Weaknesses

    • Rating (1-10):
    • Comments:
    • Are the limitations of the work clearly explained?
    • Are the weaknesses addressed along the same axes as the strengths?
    • Are the criticisms detailed, specific, and polite?
  4. Correctness

    • Rating (1-10):
    • Comments:
    • Are the claims and methods correct?
    • Is the empirical methodology sound?
    • Are there any incorrect claims or methods detailed thoroughly?
    • Is the criticism well-motivated and understandable?
  5. Clarity

    • Rating (1-10):
    • Comments:
    • Is the paper well-written?
    • Is the exposition of the paper clear?
    • What parts of the paper need revision to improve clarity?
  6. Relation to Prior Work

    • Rating (1-10):
    • Comments:
    • Is it clearly discussed how this work differs from previous contributions?
    • Does the submission show due scholarship, relating the proposed work to prior work?
    • Does the related work section explain how the proposed work differs from prior literature?
  7. Reproducibility

    • Rating (1-10):
    • Comments:
    • Are there enough details to reproduce the major results of this work?
    • Is the work reasonably reproducible?
    • If not, are the reproducibility issues listed among the weaknesses?
  8. Impacts and Implications

    • Rating (1-10):
    • Comments:
    • Have the authors adequately addressed the broader impact of their work?
    • Are potential negative ethical and societal implications considered?
  9. Ethical Considerations

    • Rating (1-10):
    • Comments:
    • Does the submission raise potential ethical concerns?
    • Are there methods, applications, or data that create or reinforce unfair bias?
    • Does the work have a primary purpose of harm or injury?
  10. Fairness

    • Rating (1-10):
    • Comments:
    • Are the review scores distributed fairly?
    • Is there a balance in the scoring, without significant bias towards extremely high or low scores?
    • Do the scores reflect a reasonable and unbiased assessment of the paper?
Monstertail commented 3 months ago

Interesting fact, GPT-4o could notice the variance of review scores. See details here. @lwaekfjlk @chengzr01 @ft2023

Fairness

Rating (1-10): 6
Comments: The reviews are generally fair, but there is a notable variance in scoring and critique depth. A more balanced approach across all reviews would enhance fairness.
Monstertail commented 3 months ago

Prompt template is here. Ratings from GPT-4o for Mamba reviewers: Overall Score=67. Dimension Scores=[7, 6, 6, 7, 8, 6, 7, 6, 8, 6]

Review Summarization
    Rating (1-10): 7
    Comments: The reviews generally summarize the paper's motivation, contributions, and achievements well. Reviewers mention key improvements like adaptive SSM parameters, the parallel recurrent algorithm, and the simplified architecture. However, there are some areas, particularly in Review 1, where more detail on certain criticisms could improve the summary.

Strengths
    Rating (1-10): 6
    Comments: The strengths are highlighted, such as the novel improvements over existing SSMs and the efficiency of the proposed methods. However, the extent to which these strengths are detailed varies, with some reviews providing more specific insights than others.

Weaknesses
    Rating (1-10): 6
    Comments: The weaknesses are addressed, but the depth of critique varies. Review 1 provides a thorough list of concerns, while other reviews are less specific. There is room for more balanced and detailed critiques.

Correctness
    Rating (1-10): 7
    Comments: Most claims and methods are correctly identified and evaluated. Reviewers note the empirical soundness and highlight potential issues, but not all reviews scrutinize the methods to the same extent.

Clarity
    Rating (1-10): 8
    Comments: The paper is praised for its clarity and presentation. Most reviews find the exposition clear, though some suggest additional references or more detailed explanations.

Relation to Prior Work
    Rating (1-10): 6
    Comments: The relation to prior work is mentioned but could be more thoroughly discussed. Some reviews note missing references and comparisons to other models, which is crucial for situating the paper within existing research.

Reproducibility
    Rating (1-10): 7
    Comments: The paper provides enough details for reproducibility, but some reviews suggest additional implementation details or code to enhance this aspect further.

Impacts and Implications
    Rating (1-10): 6
    Comments: While the potential impact is noted, particularly in improving sequence modeling efficiency, more discussion on broader implications and potential limitations would strengthen this dimension.

Ethical Considerations
    Rating (1-10): 8
    Comments: Ethical considerations are addressed, with reviewers noting no significant ethical concerns. The assessments are balanced and fair, aligning well with ethical standards.

Fairness
    Rating (1-10): 6
    Comments: The reviews are generally fair, but there is a notable variance in scoring and critique depth. A more balanced approach across all reviews would enhance fairness.