uchicago-library / uchicago-ldr

This is a set of common classes for programmers working in the uchicago ldr to develop tools for managing resources
0 stars 0 forks source link

text processing module #2

Open verbalhanglider opened 9 years ago

verbalhanglider commented 9 years ago

Need a text processing module for processing text data in accessions in the ldr

bnbalsamo commented 9 years ago

The goals of this module are primarily twofold: 1) Generate the "aboutness" half of relevance metrics in order to differentiate the content of intra-collection item-level materials. This is at the moment being approached via the use of TFIDF metrics for each document in a batch.

2) Determine document similarity to any categorically restricted materials. Current thinking points towards the use of a vector space model utilizing TFIDF numbers and training sets of different restriction categories, effectively making this a sort of topic modeling problem with two major topics: Restricted and not restricted.

bnbalsamo commented 9 years ago

Eileen has expressed an interest in the ability to pull date formatted data from appropriate arrangements of digital documents, and potentially sort/arrange it. As I think this is primarily a text parsing problem (when this data isn't coming from file level metadata) I'm going to tack this into the stated purpose of this module.

bnbalsamo commented 9 years ago

A brief update to this issue:

Generation of TFIDF numbers and vector space similarity metrics now exists for text documents. Identifying date strings in wild text is proving to be a rather more difficult problem. Basic regexes will catch dates which are regularly formatted, but this approach produces many false negatives, and an appreciable amount of false positives. There are projects such as parsedatetime https://pypi.python.org/pypi/parsedatetime/ which exist that will attempt to parse standardized date data out of n-grams, but this leaves the problem of n-gram generation from converted text content. Sanitization and n-gram generation are non-trivial problems to solve at acceptable accuracy levels for such widely varied content. For this reason date identification might be best left to when I can devote more time to understanding modern natural language processing approaches to recognition.