Annif: identify a subject vocabulary

To access the dataset, go to the Arxiv website (https://arxiv.org/) and click on the "Bulk Data Access" link at the bottom of the page. From there, you can download a compressed file that contains the dataset in a format suitable for machine learning tasks. The dataset contains over 1.7 million papers in various fields, including computer science, physics, mathematics, and more. Each paper is labeled with one or more of 23 subject categories, which can be used for text classification tasks.

Note that the dataset is quite large, so you may need significant computing resources to work with it effectively. You may also need to preprocess and clean the data before training your AI model, as the dataset contains a wide variety of paper formats and styles.

crowesn commented 1 year ago

ETDs from the University of Cincinnati are available full text from Proquest as well as the Ohio ETD Center. Authors are encouraged to supply keywords to their papers, most of which are uncontrolled.

Initially, our project is small scale. I'd propose we pull the full dataset of ETDs from OhioLINK ETD Center via OAI-PMH. Taking that, we extract all of the keywords and develop our own subject vocabulary from those keywords. This allows us to use a subset of the ETDs as training data, with the remaining corpus for validation/testing. Further, once we have a trained model, we could use documents from scholar to see how well the model does on more general documents from the repository.

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

crowesn commented 1 year ago

Also looking at https://www.oclc.org/research/areas/data-science/fast/download.html as a possibility for a vocab.

crowesn commented 1 year ago

https://github.com/NatLibFi/Annif-tutorial/blob/master/exercises/OPT_custom_corpus.md

scherztc commented 1 year ago

https://github.com/jimfhahn/Annif-tutorial/tree/master/data-sets

scherztc commented 1 year ago

Scopus has a subject Vocabulary Thesarus, too : https://service.elsevier.com/app/answers/detail/a_id/14882/supporthub/scopus/~/what-are-the-most-frequent-subject-area-categories-and-classifications-used-in/

crowesn commented 1 year ago

OAI-PMH url for UC ETDs at ohiolink: https://etd.ohiolink.edu/apexprod/!etd_search_oai?verb=ListRecords&metadataPrefix=oai_etdms&setSpec=ucin

uclibs / AI-Project

Annif: identify a subject vocabulary #15