Closed chengzr01 closed 2 months ago
Human Evaluation We perform human evaluations that involve assigning a score for each criterion regarding ideas, papers, and reviews. As the generated content are knowledge-intensive, it is crucial to select participants (who are well-versed in the field) and provide them with content that are relevant to their field of expertise.
Participants and Procedure We recruited 10 participants (5 identified as male, 5 identified as female) who have authored at least three papers from an R1 university in the United States. Each participant was provided with a 6-page guideline document, which includes the task instruction and annotation examples. On average, each participant evaluated 3 sets of research papers and reviews, with each set comprising three sub-ideas from three different approaches. They were paied with $20 per hour.
Induced Criteria To align model-based evaluations with human preferences, we induce the criteria (used for automatic evaluations) with actual human judgments.
Criteria for Human Evaluation [TBD]
Types | Criteria | Texts |
---|---|---|
Idea | Novelty | How original and unique is the idea? Does it introduce a new perspective or significant advancement compared to existing methods? How does it align with or diverge from the innovations highlighted in the trend? |
Technical Depth | Assess the technical rigor of the idea. Does it include solid theoretical foundations, robust algorithms, and detailed methodologies? Is the technical depth in line with the state-of-the-art techniques noted in the trend? | |
Impact and Significance | Evaluate the potential impact of the idea on the ML community and beyond. How significant is its contribution to advancing the field? Does it address high-impact problems or gaps identified in the trend? | |
Feasibility and Practicality | Assess the feasibility of implementing the idea. Is it practically applicable in real-world scenarios? Does it consider efficiency and scalability, in line with the practical application focus of the trend? | |
Theoretical Foundation and Conceptual Soundness | Evaluate the theoretical foundation and conceptual soundness of the idea. Are the underlying principles well-defined and logically consistent? Does the idea demonstrate a deep understanding of relevant theories and concepts? How does it contribute to advancing theoretical understanding within the field? | |
Clarity and Presentation | Assess the clarity, organization, and presentation quality of the idea. Is the idea communicated effectively, adhering to high presentation standards seen in top-tier ML conferences? | |
Potential for Real-world Applications | Evaluate the potential of the idea to be applied in real-world scenarios. How applicable is it in practical settings and industry contexts? Does it address real-world problems or challenges identified in the trend? | |
Innovation Potential | Assess the potential of the idea to inspire further research and innovation within the ML community. Does it open up new avenues for research or provide a novel framework aligning with the emerging trends and future directions of the trend? | |
Ethical Considerations | Consider the ethical implications and societal impact of the idea. Does it adhere to the growing emphasis on ethical AI and responsible ML practices as highlighted in the trend? | |
Interdisciplinary Connections | Evaluate the potential for the idea to connect with and contribute to other disciplines beyond ML. Does it align with the trend of interdisciplinary research and collaboration, integrating with fields such as data science, neuroscience, or social sciences? | |
Paper | Title Appeal | Does the title grab attention and generate interest? Is it informative and reflective of the paper's content? |
Abstract Quality | How well does the abstract summarize the paper? Is it clear, concise, and informative? Does it effectively convey the significance and main contributions of the paper? | |
Title and Abstract Consistency | How well do the title and abstract align with each other? Do they accurately represent the core idea and content of the paper? | |
Literature Review and Background | Assess the thoroughness of the literature review and background provided. Is the context and relevance of the research well-established? Does it cover key works and current trends in the field? | |
Methodology | Evaluate the soundness and appropriateness of the methodology used. Are the research design and methods clearly described and justified? Is the methodology robust and suitable for addressing the research questions? | |
Results and Analysis | Assess the quality and clarity of the results presented. Are the results well-analyzed and interpreted? Do the findings support the claims made in the paper? | |
Clarity and Presentation | Evaluate the clarity, organization, and presentation quality of the paper. | |
Contribution to the Field | Evaluate the significance of the paper's contributions to the field. Does it advance knowledge or offer new insights? How does it compare to existing works in terms of impact? | |
Ethical Considerations | Consider the ethical implications and societal impact of the work. Does it adhere to ethical guidelines and responsible research practices? Are potential negative consequences or biases addressed? | |
Interdisciplinary Connections | Evaluate the potential for the work to connect with and contribute to other disciplines. Does it integrate knowledge from other fields or offer insights relevant to them? How well does it align with the trend of interdisciplinary research and collaboration? | |
Review | Summarization | Summarize the paper's motivation, key contributions, and achievements in a paragraph. Whether there are misunderstandings that need to be addressed in their author response. |
Strengths | Describe the strengths of the work. Typical criteria include soundness of the claims (theoretical grounding, empirical evaluation), significance and novelty of the contribution, and relevance to the community. | |
Weaknesses | Explain the limitations of this work along the same axes as above. Explain the limitations of this work along the same axes as above. | |
Correctness | Are the claims and methods correct? Is the empirical methodology correct? Explain if there is anything incorrect with the paper. Incorrect claims or methodology are the primary reason for rejection. Be as detailed, specific, and polite as possible. Thoroughly motivate your criticism so that authors will understand your point of view and potentially respond to you. | |
Clarity | Is the paper well written? Rate the clarity of the exposition of the paper. Give examples of what parts of the paper need revision to improve clarity. | |
Relation to prior work | Is it clearly discussed how this work differs from previous contributions? Explain whether the submission is written with the due scholarship, relating the proposed work with the prior work in the literature. The related work section should not just list prior work, but explain how the proposed work differs from prior work that appeared in the literature. | |
Reproducibility | Are there enough details to reproduce the major results of this work? Mark whether the work is reasonably reproducible. If it is not, lack of reproducibility should be listed among the weaknesses of the submission. | |
Impacts and Implications | Have the authors adequately addressed the broader impact of their work, including potential negative ethical and societal implications of their work? | |
Ethical Considerations | Does the submission raise potential ethical concerns? This includes methods, applications, or data that create or reinforce unfair bias or that have a primary purpose of harm or injury. |
thanks! very comprehensive.
@ft2023 probably related to the paper writing part.
Fairness
- Rating (1-10):
- Comments:
- Are the review scores distributed fairly?
- Is there a balance in the scoring, without significant bias towards extremely high or low scores?
- Do the scores reflect a reasonable and unbiased assessment of the paper?
@chengzr01 I added another point for review based on what you commented before. Now we have 10 dimensions.
Review Summarization
Strengths
Weaknesses
Correctness
Clarity
Relation to Prior Work
Reproducibility
Impacts and Implications
Ethical Considerations
Fairness
Interesting fact, GPT-4o could notice the variance of review scores. See details here. @lwaekfjlk @chengzr01 @ft2023
Fairness
Rating (1-10): 6
Comments: The reviews are generally fair, but there is a notable variance in scoring and critique depth. A more balanced approach across all reviews would enhance fairness.
Prompt template is here. Ratings from GPT-4o for Mamba reviewers: Overall Score=67. Dimension Scores=[7, 6, 6, 7, 8, 6, 7, 6, 8, 6]
Review Summarization
Rating (1-10): 7
Comments: The reviews generally summarize the paper's motivation, contributions, and achievements well. Reviewers mention key improvements like adaptive SSM parameters, the parallel recurrent algorithm, and the simplified architecture. However, there are some areas, particularly in Review 1, where more detail on certain criticisms could improve the summary.
Strengths
Rating (1-10): 6
Comments: The strengths are highlighted, such as the novel improvements over existing SSMs and the efficiency of the proposed methods. However, the extent to which these strengths are detailed varies, with some reviews providing more specific insights than others.
Weaknesses
Rating (1-10): 6
Comments: The weaknesses are addressed, but the depth of critique varies. Review 1 provides a thorough list of concerns, while other reviews are less specific. There is room for more balanced and detailed critiques.
Correctness
Rating (1-10): 7
Comments: Most claims and methods are correctly identified and evaluated. Reviewers note the empirical soundness and highlight potential issues, but not all reviews scrutinize the methods to the same extent.
Clarity
Rating (1-10): 8
Comments: The paper is praised for its clarity and presentation. Most reviews find the exposition clear, though some suggest additional references or more detailed explanations.
Relation to Prior Work
Rating (1-10): 6
Comments: The relation to prior work is mentioned but could be more thoroughly discussed. Some reviews note missing references and comparisons to other models, which is crucial for situating the paper within existing research.
Reproducibility
Rating (1-10): 7
Comments: The paper provides enough details for reproducibility, but some reviews suggest additional implementation details or code to enhance this aspect further.
Impacts and Implications
Rating (1-10): 6
Comments: While the potential impact is noted, particularly in improving sequence modeling efficiency, more discussion on broader implications and potential limitations would strengthen this dimension.
Ethical Considerations
Rating (1-10): 8
Comments: Ethical considerations are addressed, with reviewers noting no significant ethical concerns. The assessments are balanced and fair, aligning well with ethical standards.
Fairness
Rating (1-10): 6
Comments: The reviews are generally fair, but there is a notable variance in scoring and critique depth. A more balanced approach across all reviews would enhance fairness.
Description
Insightful
Additional Information
No response