Automatic classification setup

fsteeg commented 6 years ago

As announced in our internal planning document (AEP), we want to expand our expertise in text mining.

As a reasonable, well-defined, and useful project in that area, I suggest we should attempt to set up automatic classification for titles in our union catalog. More than half of our catalog has no subjects:

http://lobid.org/resources/search?q=NOT+_exists_%3Asubject http://lobid.org/resources/search?q=_exists_%3Asubject

The basic approach could be this: we use a part of the classified documents as our training set, and the rest of the classified documents as the gold standard to evaluate our classification method. The training and gold sets will have to contain a selection of documents across all subjects. With this basic setup, we can experiment with different features to represent a document and different classification algorithms.

When we get good results with our gold set, we can apply our classification method to the unclassified documents without subjects. These can get a new value for the subject.source, allowing them to be treated differently in queries and display.

fsteeg commented 6 years ago

Added more text to the initial comment after accidentally posting it. Thoughts, @acka47 @dr0i?

ChristophEwertowski commented 6 years ago

A quick shot: Interesting fields could be title, shortTitle, otherTitleInformation or bibliographicCitation. The source would have to be a different one or maybe add it to the title like the DNB supposedly does (I can't open it, the converter is broken).

acka47 commented 6 years ago

There has already been a lot of discussion about automatic subject indexing whoch I have only noticed a small part. From what I heard, it makes most sense to assist subject indexers with automatic tools in a semi-automatic process. And I guess it doesn't make much sense to make a subject guess based only on the bibliographic data. My feeling is that you'd at least need an abstract to get a satisfying result (but I haven't reviewed the literature & could not quickly find a project where this has been tested).

We have around 446k resources with a fulltext or summary link and without subject information, see http://lobid.org/resources/search?q=NOT+_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29 compared to 532k that have both (http://lobid.org/resources/search?q=_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29). Maybe that is a good subset to start with. (Probably, we won't have access to 100% of the linked fulltexts via the hbz network but I guess we could retrieve the majority.

dr0i commented 6 years ago

I would have loved to play with automatic subject enrichment ever since. Without fulltexts we would have to be more brave since we will produce more inadequate subjects. Also, concerning the non-fulltexts. for those resources without subjects but with authors linked with other resources which already have subjects I think we won't be that bad in guessing the proper subjects - because then we would already have some domain specific knowledge. (same would be true, although broader and less exact, with publisher: e.g. O'reilly is to be expected of being in the IT/tech domain). So, there are ways to get proper subjects. Also, I think, RVK is not totally incorporated in hbz01 because there is no good matching with resources in the e.g. BVB, right @acka47 ? I would go the easy and save pathes first, in this order:

[ ] RVK enrichment
[ ] text mining using fulltexts/abstract/tocs
[ ] text mining using the metadata, combined with domain-specific knowledge (author, publisher ...)
[ ] text mining metadata with expected highly inaccurate results

Add/sub/rearrange that list.

acka47 commented 6 years ago

Here is a highly relevant article from 2017 titled "Using Titles vs. Full-text as Source for Automated Semantic Document Annotation": https://arxiv.org/abs/1705.05311

The abstract says:

The results show that across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the classification performance when using the full-text. Thus, conducting document classification by just using the titles is a reasonable approach for automated semantic annotation and opens up new possibilities for enriching Knowledge Graphs.

dr0i commented 6 years ago

There is a no newer article which implies the same https://arxiv.org/abs/1801.06717 - so we should go with title-based subject extraction!

acka47 commented 6 years ago

Notably, the projects in the two papers operated over quite homogenous data ("two datasets are obtained from scientific digital libraries in the domains of economics and political sciences along with two news datasets from Reuters and New York Times"). Compared to this, the hbz catalog has descriptions of very heterogeneous bibliographic resources. It would probably make sense to create more homogenous subsets first before conducting the automatic indexing or at least to take important information about the field of the resource into account. As I recently said it in face-to-face talk: A good indicator of the topic is also given by the holding institutions as many of them have a focus of collection.

E.g. to get a list of libraries from the hbz network whose collections include resources on economics you can ask lobid-organisations like this: http://lobid.org/organisations/search?q=linkedTo.id%3A%22http%3A%2F%2Flobid.org%2Forganisations%2FDE-605%23%21%22+AND+Wirtschaftswissenschaften&location=

fsteeg commented 6 years ago

As a smaller set that is also somewhat more homogeneous than the full catalog we could use NWBib:

https://test.nwbib.de/search?location=&q=NOT+_exists_%3Asubject https://test.nwbib.de/search?location=&q=_exists_%3Asubject

Completing the NWBib subjects would also be a self-contained project before taking on the full catalog.

acka47 commented 6 years ago

+1 I think the NWBib editors would even review the results and take over correct subjects into Aleph.

fsteeg commented 6 years ago

I think we have a useful setup for further experiments in https://github.com/fsteeg/python-data-analysis:

nwbib_subjects_load.py and nwbib_subjects_process.py provide a way to run multiple experiments against a small data set (~500 entries), best result with the checked in config is 0.38 (38% correctly classified entries), output at https://github.com/fsteeg/python-data-analysis/blob/master/nwbib/nwbib-subjects-predict.csv
nwbib_subjects_bulk.py uses a bulk request to get more data from the lobid API. The checked in config uses a mid-size data set (~30k entries, 99% for training, 1% for testing) resulting in an accuracy of 0.47, output at https://github.com/fsteeg/python-data-analysis/blob/master/nwbib/nwbib-subjects-bulk-predict.csv
Running the same bulk classification with the full NWBib data (~400k entries) we get 0.51 accuracy

The bulk classification uses a different vectorizer than the small set experiments to make it work with larger data sets, so the results are not directly comparable. The mid-size and full-size setup however use the same setup, and yield very comparable results, so the mid-size setup should be a useful basis for further experiments with a runtime low enough to run multiple setups (which features to use, how to configure the vectorizer, which classifier to use and how to configure it).

Some areas to investigate next:

[x] Use the sklearn pipeline API
[x] Runtime issues: measure, include in scoring, run different setups concurrently
[x] Experiment with word embeddings, paragraph vectors
[ ] Consider scikit-learn wrappers in gensime (doc2vec, word2vec)
[ ] Set up Jupyter notebooks (see gensime notebooks, Word2Vec for document classification)
[ ] Visualize classifier result distribution and result accuracy
[ ] Experiment with which fields to use, run experiments with all subsets of all fields
[ ] Collect additional textual data from super- and subordinate entries, compare results
[ ] Also work with Raumsystematik, not just Sachsystematik, what about GND subjects?
[ ] Investigate multi-class classification (currently only 1 class used for training and testing)

TobiasNx commented 1 year ago

Should we reconsider this since this was tested NWBIB https://github.com/hbz/nwbib/issues/560

hbz / lobid-resources

Automatic classification setup #681