Description

Using some kind of clustering algorithm to predict a class per document. Classes may be genre, topic, usefulness, etc. Finding the closest cluster per document relies on a distance metric.

Objectives

Implement different clustering algorithms to classify documents into an arbitrary set of classes. Text similarity would be a good starting point as the distance metric utilized.
Use zero-shot learning (ZSL) to classify documents from a group of pre-determined classes. HuggingFace has a pipeline for that. Checkout the comments in here.

ufosc / DocuMiner

Unsupervised Classification of Documents #7

Description

Objectives