openreview / openreview-js

Node.js client library for OpenReview's academic publishing API
MIT License
5 stars 0 forks source link

Abstract extraction - mindmodeling.org #55

Open xkopenreview opened 1 week ago

xkopenreview commented 1 week ago

mindmodeling.org seems to be offline forever and majority of the abstract extraction failures are from this domain. so it's better to have a rule to handle paper html in this domain

there are two possible approaches:

  1. use web archive it's possible that the paper is not saved so there's no cached page of mindmodeling.org it's also possible that the saved page is wrong for exmaple the paper "Grounding Compositional Hypothesis Generation in Specific Instances" (LVuRVdG9hC) has html https://mindmodeling.org/cogsci2018/papers/0271/index.html but the cached page https://web.archive.org/web/20180729174438/http://mindmodeling.org/cogsci2018/papers/0271/index.html is for a different paper
  2. use escholarship https://escholarship.org/uc/cognitivesciencesociety seems to contain all proceedings of congsci which should contain papers from mindmodeling.org. need to figure out the links/search

abstract from both links could be wrong in terms of word breaks

xkopenreview commented 1 week ago

for approach 2 url like this https://escholarship.org/api/pageData/uc/cognitivesciencesociety/46/0 (46/0) corresponds to the issue of the year

content.issue.sections[int].articles[int] has id, title, authors and abstract which can be used to look up the paper abstract

id is in forms of qt{actual id} and actual id can be used to link to the actual papers page in https://escholarship.org/uc/item/${actual id}

xkopenreview commented 1 week ago

currently extract abstract requires only the url this will need to be changed to include title and venue at least