uclibs / AI-Project

Planning for App Dev AI project
0 stars 0 forks source link

Annif: Find existing content in Scholar to be used as a training set #3

Open hortongn opened 1 year ago

hortongn commented 1 year ago

Create a list of works and/or collections in Scholar that have embedded text and existing metadata. We want a good mix of examples. Different files types, files with minimal metadata as well as well-described files.

Probably best to start with works that have only 1 file attached to avoid confusion.

https://github.com/NatLibFi/Annif-corpora

crowesn commented 1 year ago

I think this is the method used to get full text from files for indexing, could be useful when building a dataset.

https://github.com/samvera/hyrax/blob/eb8d42d4fb99f8c7e2116af51b5642bf07312ce7/app/services/hyrax/file_set_derivatives_service.rb#L123

hortongn commented 1 year ago

Question: Send the document to AI or have Scholar extract the full text and send to AI?

hortongn commented 1 year ago

Some Scholar collections that may have useful content for a training set:

CEAS Electrical Engineering and Computing Systems (EECS) Senior Design Projects https://scholar.uc.edu/collections/x633f229p (student works)

2017 CECH Information Technology Senior Design Projects https://scholar.uc.edu/collections/bc3888732 (student works)

The Lucille M. Schultz 19th Century Composition Archive https://scholar.uc.edu/collections/05741w32f (documents)

Modernnati: Archiving & Preserving Cincinnati's Modernist Architecture https://scholar.uc.edu/collections/9p290b783 (articles)

Nature of Black Holes https://scholar.uc.edu/collections/t722hb29d (articles)

Cincinnati Romance Review https://scholar.uc.edu/collections/hd76s1380 (datasets???)

2019 Information Technology Research Symposium https://scholar.uc.edu/collections/jq085m248 (articles)