Open stephbuon opened 2 years ago
Term Frequency-Inverse Document Frequency (TF-IDF) Term Frequency-Inverse Document Frequency, often simply called TF-IDF, identifies and measures the importance of unique words or phrases in a corpus. The process does so by calculating the frequency of a document’s keywords and then comparing this frequency to other documents. TF-IDF then flags words that have a high frequency in a corpus compared to their frequency in other bodies of text. TF-IDF is useful because it highlights unique words a document uses rather than words that are common to a multitude of texts such as ‘the’, ‘but’, or ‘and’. One way scholars use TF-IDF is to identify what a document in a collection is about. For instance, a document in a collection about major events in American history may have high TF-IDF counts for the words ‘Antietam’ and ‘Confederacy’ compared to other documents. Because it discusses these terms disproportionately more than other documents in the collection, scholars can then identify this text’s subject as the battle of Antietam during the Civil War. Scholars may have multiple documents on the same topic, as well. TF-IDF can help differentiate these documents by highlighting the key words and phrases that set them apart from one another.
(see syllabus for instructions).