thoppe / arXiv2git

Chrome extension that links arXiv papers to github repos
https://chrome.google.com/webstore/detail/arXiv2git
21 stars 2 forks source link

Github links not found by extension, despite links being present #1

Open mori-c opened 5 years ago

mori-c commented 5 years ago

Problem


Situation

Here's an example of the extension returning the following statement:

screencapture-arxiv-org-abs-1903-12112-2019-04-13-16_27_18

arXiv2git extension by Travis Hoppe [No github links found]


Reference

With hopes the extension would return something similar to this source:


Proof

While the paper referenced four GitHub links

screencapture-arxiv-org-pdf-1903-12112-pdf-2019-04-13-16_34_09

thoppe commented 5 years ago

oh my gosh, I didn't think anyone actually used this! The paper link doesn't show because I haven't run the scrapper in (looks at last commit ... three years!). If you're interested, I could probably start this back up again. The arXiv people mentioned they would do something like this (and I thought arXiv sanity would help), but it still seems to be a problem.

Thoughts? Want to help?

mori-c commented 5 years ago

Hah, the arXiv2git isn't a bad idea. I've never done a chrome extension before, so if you're still open, knowing that I'll need some guidance, I'll be happy to help. I'll look at your code shortly; what would the steps be to get this running aside from scrapping?

thoppe commented 5 years ago

The chome extension pulls the data from GitHub (I think, I'll have to check when I get home). Basically though, the data is served from the static file you see on the repo.

On Tue, Apr 16, 2019, 10:40 PM mori-c notifications@github.com wrote:

Hah, the arXiv2git isn't a bad idea. I've never done a chrome extension before, so if you're still open, knowing that I'll need some guidance, I'll be happy to help. I'll look at your code shortly; what would the steps be to get this running aside from scrapping?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/thoppe/arXiv2git/issues/1#issuecomment-483915198, or mute the thread https://github.com/notifications/unsubscribe-auth/AClOoiCIsKwmx6ZVGgz-LRGsvtTu7zSrks5vhomSgaJpZM4cuQvf .

thoppe commented 5 years ago

Yup, it pulls the data directly from github. See

https://github.com/thoppe/arXiv2git/blob/01d8fb7a97e91c0564e04ec0e03a126963b115bf/chrome_extension/content.js#L65

The basic idea is to query github for repos with arXiv in the README,

 q = ' '.join([
        "arxiv",
        "in:description,readme",
        "created:{date}".format(date=date),
        "fork:false",
    ])

I chunked it by one per month to get something reasonable. After that, each README was downloaded and parsed for arXiv links. In theory, we could rerun the pipeline (and maybe clean it up?) to get updated links. To do it properly, the service would be run once a month.