ufosc / DocuMiner

A production-ready pipeline for text mining and subject indexing
MIT License
8 stars 5 forks source link

Unsupervised Classification of Documents #7

Open Fennec2000GH opened 2 years ago

Fennec2000GH commented 2 years ago

Description

Using some kind of clustering algorithm to predict a class per document. Classes may be genre, topic, usefulness, etc. Finding the closest cluster per document relies on a distance metric.

Objectives

  1. Implement different clustering algorithms to classify documents into an arbitrary set of classes. Text similarity would be a good starting point as the distance metric utilized.
  2. Use zero-shot learning (ZSL) to classify documents from a group of pre-determined classes. HuggingFace has a pipeline for that. Checkout the comments in here.