uclibs / AI-Project

Planning for App Dev AI project
0 stars 0 forks source link

Annif: identify a subject vocabulary #15

Open hortongn opened 1 year ago

hortongn commented 1 year ago

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

https://github.com/NatLibFi/Annif-corpora

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

scherztc commented 1 year ago

https://github.com/samvera/questioning_authority/wiki https://github.com/samvera/questioning_authority

Questioning Authority is a gem developed by the Samvera Community that might help with subject vocabularies

scherztc commented 1 year ago

Structure of Subject

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

hortongn commented 1 year ago

Vocabulary links from yesterday's meeting:

https://finto.fi/yso/en/

https://id.loc.gov/authorities/subjects.html

https://id.loc.gov/download/

https://skos.um.es/unescothes/

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

haitzlm commented 1 year ago

This might be interesting :

The Arxiv Academic Paper Dataset can be found on the Arxiv website, which is an open-access repository of scientific papers in various fields.

To access the dataset, go to the Arxiv website (https://arxiv.org/) and click on the "Bulk Data Access" link at the bottom of the page. From there, you can download a compressed file that contains the dataset in a format suitable for machine learning tasks. The dataset contains over 1.7 million papers in various fields, including computer science, physics, mathematics, and more. Each paper is labeled with one or more of 23 subject categories, which can be used for text classification tasks.

Note that the dataset is quite large, so you may need significant computing resources to work with it effectively. You may also need to preprocess and clean the data before training your AI model, as the dataset contains a wide variety of paper formats and styles.

crowesn commented 1 year ago

ETDs from the University of Cincinnati are available full text from Proquest as well as the Ohio ETD Center. Authors are encouraged to supply keywords to their papers, most of which are uncontrolled.

Initially, our project is small scale. I'd propose we pull the full dataset of ETDs from OhioLINK ETD Center via OAI-PMH. Taking that, we extract all of the keywords and develop our own subject vocabulary from those keywords. This allows us to use a subset of the ETDs as training data, with the remaining corpus for validation/testing. Further, once we have a trained model, we could use documents from scholar to see how well the model does on more general documents from the repository.

https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats

crowesn commented 1 year ago

Also looking at https://www.oclc.org/research/areas/data-science/fast/download.html as a possibility for a vocab.

crowesn commented 1 year ago

https://github.com/NatLibFi/Annif-tutorial/blob/master/exercises/OPT_custom_corpus.md

scherztc commented 1 year ago

https://github.com/jimfhahn/Annif-tutorial/tree/master/data-sets

scherztc commented 1 year ago

Scopus has a subject Vocabulary Thesarus, too : https://service.elsevier.com/app/answers/detail/a_id/14882/supporthub/scopus/~/what-are-the-most-frequent-subject-area-categories-and-classifications-used-in/

crowesn commented 1 year ago

OAI-PMH url for UC ETDs at ohiolink: https://etd.ohiolink.edu/apexprod/!etd_search_oai?verb=ListRecords&metadataPrefix=oai_etdms&setSpec=ucin