princeton-nlp / LitSearch

A Retrieval Benchmark for Scientific Literature Search
MIT License
53 stars 3 forks source link

Inline-citation questions #1

Open wyzhhhh opened 2 weeks ago

wyzhhhh commented 2 weeks ago

How does the author get the inline information from the S2ORC dataset?

anirudhajith commented 1 week ago

Hi @wyzhhhh,

The full S2ORC dataset releases include an "annotations" field along with the paper data. This field contains information about the indices corresponding to various parts (eg. title, abstract, author names, individual paragraphs, etc.) of the paper's plaintext.

Here's an illustration of the S2ORC schema:

Screenshot 2023-10-25 at 12 17 28 AM

We used the indices listed under the "bibref" annotations to isolate the positions of inline citations. These annotations also usually included a "matched_paper_id" field that we could use to match an inline citation from a source paper to a cited target paper within the S2ORC dataset.

I hope this answers your question. Let us know if you have any more!