Automatic metadata fetching via API call

lopusz commented 5 years ago

Dear Sebastian, dear NLP-progress Contributors,

Thank you for creating this database!

More of a question than issue here...

I believe, I have interesting idea for improving this resource (which I tested in my own list of papers on interpretable ML https://github.com/lopusz/awesome-interpretable-machine-learning). I thought I might share it also here.

The idea is to provide only arXiv id (and scores) in the yaml files and let the script generate title, authors, year and url on the basis of the arXiv API call. If the paper is not available at arXiv the same can be done via Semantic Scholar API or doi API.

This essentially reduces the amount of copy & paste and ensures good consistency of metadata.

I have created a little proof of concept here. https://github.com/lopusz/NLP-progress

The sample simplified yaml "template" (only with ids is here): https://github.com/lopusz/NLP-progress/blob/devel/_data/dependency_parsing.yaml.template

It is processed by gener_yaml python script (requires pyyaml package) https://github.com/lopusz/NLP-progress/blob/devel/_data/gener_yaml.py

and produces the full yaml for jekyll. https://github.com/lopusz/NLP-progress/blob/devel/_data/dependency_parsing.yaml

Workflow can be traced in the Makefile. https://github.com/lopusz/NLP-progress/blob/devel/Makefile

Do you think it is useful?

In addition to improving consistency, it could ease "yamlisation" of the other components. Also one could think of easily including more metadata (e.g. more accurate publication date for better timeline graphing?)

If you find it interesting we could think how to refactor this POC, so it best fits the NLP-progress workflow...

Best regards, Michał

sebastianruder commented 5 years ago

Hey, sorry about my late response. This sounds like a useful idea. I just have two concerns:

Even though arXiv is very popular, not all papers are on arXiv. Many are just available in the refereed proceedings (e.g. ACL Anthology, AAAI), which don't have an API. How would you deal with these?
As far as I can see, anyone who wants to contribute to the repo needs to run gener_yaml.py to produce the full yaml. Is there another way? If not, I think this places too heavy a burden on contributors; I also think having two yaml files (one template and the full version) might get confusing.

lopusz commented 5 years ago

Hi Sebastian,

this time I apologize for a slow response. I was off-line for two weeks.

Concerning the first bullet. The script can deal not only with arXiv API, but also with DOI API and Semantics Scholar API. Especially, Semantics Scholar has a huge database. For example for the dependency parsing I could easily fetch metadata for every paper via API. My feeling is that these three APIs will for sure cover >95% of the listed resources. If all data was in YAML format one could easily write a short script checking the coverage.

Of course for the "unAPIzed" papers one can still fallback to entering all the details by hand as it is now.

lopusz commented 5 years ago

As far as the second bullet is concerned. That is essentially my question - if and how would you see integrating this in your maintenance workflow? The idea would be that contributors need to enter either arXiv ID, DOI or Semantics Scholar id and the tooling would do the rest. I believe it is worth considering. Your repo will be growing, e.g. with addition of new languages or tasks, so tools improving the consistency of data would definitely increase the overall quality of the NLP progress. For example, this would mean no more title/link inconsistencies like in https://github.com/sebastianruder/NLP-progress/pull/95 or inconsistencies with arxiv.org/abs vs arxiv.org/pdf that are there now.

Another benefit of having full metadata would be that tooling could easily generate the downloadable bibtex file for every task/language which could be helpful for many users.

Best regards, Michał

sebastianruder commented 5 years ago

Hey Michał, I'm really sorry about my late reply. I meant to answer sooner, but somehow this slipped through the cracks. I really like the idea and would love to offer more functionality on top of this. However, we've decided that we'll stick with storing data in Markdown for now. I'm not sure if this is still compatible with that.

sebastianruder / NLP-progress

Automatic metadata fetching via API call #109