ulab-uiuc / research-town

A platform for developers to simulate research community
http://app.auto-research.dev
Apache License 2.0
83 stars 7 forks source link

[FEAT]: Get paper information with reviews from OpenReview efficiently #166

Open Monstertail opened 5 months ago

Monstertail commented 5 months ago

Description

To collect more data for evaluation of review.

Additional Information

No response

lwaekfjlk commented 5 months ago

choose differnet domain

Monstertail commented 5 months ago
  1. information on reviews and papers to collect could refer here. If possible, you can collect more information for our future experiments like authors, meta-reviewers, etc.
  2. research domains to be collected-- to be decided, but you can refer here first. You could think about how to collect papers from OpenView in different domains (we need to group papers by their domains for microbench)-- like Machine learning systems, GNNs, etc. @Kunlun-Zhu
Kunlun-Zhu commented 5 months ago

An repo might be able to use: repo

Kunlun-Zhu commented 5 months ago

I also found we can use openreview API with example

Monstertail commented 5 months ago

As for collection format,

you could refer to,

 "ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching": {
    "paper_pk": null,
    "title": "ChunkAttention: Efficient Attention on KV Cache with Chunking Sharing and Batching",
    "abstract": "Self-attention is an essential component of GPT-style models and a significant cause of LLM inference latency for long sequences. In multi-tenant LLM inference servers, the compute and memory operation cost of self-attention can be amortized by making use of the probability that sequences from users may share long prompt prefixes. This paper introduces ChunkAttention, a unique self-attention kernel built on chunking, sharing the KV cache, and batching the attention computation. ChunkAttention recognizes matching prompt prefixes across several sequences and shares their KV cache in memory by chunking the KV cache and structuring it into the auxiliary prefix tree. To significantly improve the memory reuse of KV cache and consequently the speed of self-attention for long shared prompts, we design an efficient computation kernel on this new storage structure, where two-phased partitioning is implemented to reduce memory operations on shared KV cache during self-attention. Experiments show that ChunkAttention can speed up self-attention of long shared prompts 1.6-3 times, with lengths ranging from 1024 to 8192.",
    "authors": [],
    "keywords": [
      "large language model",
      "model inference",
      "self attention"
    ],
    "real_avg_scores": null,
    "real_all_scores": [
      5,
      5,
      5,
      3
    ],
    "real_contents": [],
    "real_rank": 0,
    "real_decision": "Reject",
    "sim_avg_scores": null,
    "sim_all_scores": [],
    "sim_contents": [],
    "sim_rank": 0,
    "sim_decision": null
  }

To point out,

    "real_rank": 0,
(we need "real_decision": "Reject", so I skip a line here)
    "sim_avg_scores": null,
    "sim_all_scores": [],
    "sim_contents": [],
    "sim_rank": 0,
    "sim_decision": null

those keys are what I allocate in advance, but we don't really need them for the paper review collection... It would be great if we could get more information on papers like authors, detailed review contents(could save in key '"real_contents": [],', meta-review, etc.. You could add more keys. @Kunlun-Zhu

Kunlun-Zhu commented 5 months ago

prepared one dataset for jinwei need review its quality

lwaekfjlk commented 3 months ago

@Kunlun-Zhu can you provide some more guidance on how to get those data? I remember you mentioned that you used a package or environment to do that?

Kunlun-Zhu commented 3 months ago

The code to get the data is here: ICLR_2023_data

lwaekfjlk commented 3 months ago

thanks