This is the official crawler implementation for the D3 Dataset in almost pure python. The crawler is also used for the cs-insights project.
Starting from version 1.0.2, this project is using semantic versioning, and supports SemanticScholar. For more info about the features supported, see the releases.
First install the package manager poetry:
pip install poetry
Then run:
poetry install
To start the crawling process, run:
poetry run cli main --s2_use_papers --s2_use_abstracts --s2_filter_dblp
For help run:
poetry run cli main --help
To maintain a consistent and well-tested repository, we use unit tests, linting, and typing checkers with GitHub actions. We use pytest for testing, pylint for linting, and pyright for typing. Every time code gets pushed to our repository these checks are executed and have to fullfill certain requirements before you can merge the code to our master branch.
Whenever you create a pull request against the default branch, GitHub actions will create a CI job executing unit tests and linting.
To run all tests that are tested during CI locally, run:
poetry run poe alltest
Fork the repo, make changes and send a PR. We'll review it together!
Commit messages should follow Angular's conventions.
This project is licensed under the terms of MIT license. For more information, please see the LICENSE file.
If you use this repository, or use our tool for analysis, please cite our work:
If you use this repository, or use our tool for analysis, please cite our work:
@inproceedings{Wahle2022c,
title = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
author = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
year = {2022},
month = {July},
booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
publisher = {European Language Resources Association},
address = {Marseille, France},
doi = {},
}
Also make sure to cite the following papers if you use SemanticScholar data:
@inproceedings{ammar-etal-2018-construction,
title = "Construction of the Literature Graph in Semantic Scholar",
author = "Ammar, Waleed and
Groeneveld, Dirk and
Bhagavatula, Chandra and
Beltagy, Iz",
booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)",
month = jun,
year = "2018",
address = "New Orleans - Louisiana",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N18-3011",
doi = "10.18653/v1/N18-3011",
pages = "84--91",
}
@inproceedings{lo-wang-2020-s2orc,
title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
author = "Lo, Kyle and Wang, Lucy Lu and Neumann, Mark and Kinney, Rodney and Weld, Daniel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.447",
doi = "10.18653/v1/2020.acl-main.447",
pages = "4969--4983"
}