Environmental scan of potential AI/ML models

hortongn commented 1 year ago

Create a list of models that we could potentially use to extract text from documents and suggest metadata. We will start with basic metadata like title, description, etc. and eventually move on to optional metadata fields found in Scholar.

We ideally want to use "machine learning as a service" options that will host things for us, but we can also explore open source options.

haitzlm commented 1 year ago

- https://www.aiforlibrarians.com/ai-cases/

https://annif.org/
https://pubmed.ncbi.nlm.nih.gov/30153250/
https://iris.ai/
Annif: DIY automated subject indexing using multiple algorithms: --https://liberquarterly.eu/article/view/10732:
Automated Classification to Improve the Efficiency of Weeding Library Collections --https://www.sciencedirect.com/science/article/pii/S0099133317304160?via%3Dihub
https://github.com/openai (Open AI on Github)
https://github.com/openai/openai-cookbook
Can Machine Learning be used to assign managed metadata attributes for items? --https://learn.microsoft.com/en-us/microsoft-365/community/machine-learning-and-managed-metadata
Apache Mahout --https://mahout.apache.org//
Spark MLlib Apache o https://spark.apache.org/mllib/

• https://library.stanford.edu/blogs/stanford-libraries-blog/2022/07/working-students-library-collections-data

hortongn commented 1 year ago

Next steps:

categorize what models can be used for specific metadata fields.
Expand on the existing list - more examples/resources

hortongn commented 1 year ago

Consider making use of the metadata tags that may already be embedded in a document (PDF, Word, etc.)

hortongn commented 1 year ago

An AI toolkit for libraries (paper) https://insights.uksg.org/articles/10.1629/uksg.592

Integrating Ruby with OpenAI: A Beginner’s Guide https://ai.plainenglish.io/integrating-ruby-with-openai-a-beginners-guide-88ffaa10f202

GPT-JT is an open source GPT-3 alternative with a decentralized approach https://the-decoder.com/gpt-jt-is-an-open-source-gpt-3-alternative-with-a-decentralized-approach/

hortongn commented 1 year ago

How to use Microsoft AI Builder to Extract Data from PDF https://www.youtube.com/watch?v=J3d6bx3i4l0&ab_channel=KevinStratvert

MS PowerAutomate (part of Office 365) https://powerautomate.microsoft.com

haitzlm commented 1 year ago

Interesting: Text Analytics APIs are machine learning-powered services that allow developers to analyze and extract insights from text-based data. These APIs use natural language processing (NLP) techniques to automatically identify and extract entities, sentiments, topics, and other relevant information from text.

Here's a high-level overview of how Text Analytics APIs work:

Data Input: The API accepts text-based data as input, such as documents, social media posts, or customer feedback.
Preprocessing: The API preprocesses the input data to clean and normalize it. This may include tasks such as tokenization, stop-word removal, and stemming.
Feature Extraction: The API uses NLP techniques to extract features from the text data. This may include identifying entities such as people, organizations, and locations; extracting sentiments such as positive or negative; and identifying topics or themes.
Analysis and Output: The API analyzes the extracted features and generates insights or summaries based on the input data. The output may include visualizations, reports, or structured data that can be easily consumed by applications.
Some common use cases for Text Analytics APIs include sentiment analysis of social media data, entity extraction from news articles, and topic modeling for customer feedback.

Some popular Text Analytics APIs include:

Google Cloud Natural Language API
Microsoft Azure Cognitive Services Text Analytics API
Amazon Comprehend
IBM Watson Natural Language Understanding

By using Text Analytics APIs, developers can leverage the power of machine learning to extract valuable insights from text-based data with minimal effort and expertise.

uclibs / AI-Project

Environmental scan of potential AI/ML models #4