Open fsteeg opened 6 years ago
Added more text to the initial comment after accidentally posting it. Thoughts, @acka47 @dr0i?
A quick shot: Interesting fields could be title
, shortTitle
, otherTitleInformation
or bibliographicCitation
. The source would have to be a different one or maybe add it to the title like the DNB supposedly does (I can't open it, the converter is broken).
There has already been a lot of discussion about automatic subject indexing whoch I have only noticed a small part. From what I heard, it makes most sense to assist subject indexers with automatic tools in a semi-automatic process. And I guess it doesn't make much sense to make a subject guess based only on the bibliographic data. My feeling is that you'd at least need an abstract to get a satisfying result (but I haven't reviewed the literature & could not quickly find a project where this has been tested).
We have around 446k resources with a fulltext or summary link and without subject information, see http://lobid.org/resources/search?q=NOT+_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29 compared to 532k that have both (http://lobid.org/resources/search?q=_exists_%3Asubject+AND+%28_exists_%3AfulltextOnline+OR+_exists_%3Adescription%29). Maybe that is a good subset to start with. (Probably, we won't have access to 100% of the linked fulltexts via the hbz network but I guess we could retrieve the majority.
I would have loved to play with automatic subject enrichment ever since. Without fulltexts we would have to be more brave since we will produce more inadequate subjects. Also, concerning the non-fulltexts. for those resources without subjects but with authors linked with other resources which already have subjects I think we won't be that bad in guessing the proper subjects - because then we would already have some domain specific knowledge. (same would be true, although broader and less exact, with publisher: e.g. O'reilly is to be expected of being in the IT/tech domain). So, there are ways to get proper subjects. Also, I think, RVK is not totally incorporated in hbz01 because there is no good matching with resources in the e.g. BVB, right @acka47 ? I would go the easy and save pathes first, in this order:
Add/sub/rearrange that list.
Here is a highly relevant article from 2017 titled "Using Titles vs. Full-text as Source for Automated Semantic Document Annotation": https://arxiv.org/abs/1705.05311
The abstract says:
The results show that across three of our four datasets, the performance of the classifications using only titles reaches over 90% of the quality compared to the classification performance when using the full-text. Thus, conducting document classification by just using the titles is a reasonable approach for automated semantic annotation and opens up new possibilities for enriching Knowledge Graphs.
There is a no newer article which implies the same https://arxiv.org/abs/1801.06717 - so we should go with title-based subject extraction!
Notably, the projects in the two papers operated over quite homogenous data ("two datasets are obtained from scientific digital libraries in the domains of economics and political sciences along with two news datasets from Reuters and New York Times"). Compared to this, the hbz catalog has descriptions of very heterogeneous bibliographic resources. It would probably make sense to create more homogenous subsets first before conducting the automatic indexing or at least to take important information about the field of the resource into account. As I recently said it in face-to-face talk: A good indicator of the topic is also given by the holding institutions as many of them have a focus of collection.
E.g. to get a list of libraries from the hbz network whose collections include resources on economics you can ask lobid-organisations like this: http://lobid.org/organisations/search?q=linkedTo.id%3A%22http%3A%2F%2Flobid.org%2Forganisations%2FDE-605%23%21%22+AND+Wirtschaftswissenschaften&location=
As a smaller set that is also somewhat more homogeneous than the full catalog we could use NWBib:
https://test.nwbib.de/search?location=&q=NOT+_exists_%3Asubject https://test.nwbib.de/search?location=&q=_exists_%3Asubject
Completing the NWBib subjects would also be a self-contained project before taking on the full catalog.
+1 I think the NWBib editors would even review the results and take over correct subjects into Aleph.
I think we have a useful setup for further experiments in https://github.com/fsteeg/python-data-analysis:
nwbib_subjects_load.py
and nwbib_subjects_process.py
provide a way to run multiple experiments against a small data set (~500 entries), best result with the checked in config is 0.38 (38% correctly classified entries), output at https://github.com/fsteeg/python-data-analysis/blob/master/nwbib/nwbib-subjects-predict.csvnwbib_subjects_bulk.py
uses a bulk request to get more data from the lobid API. The checked in config uses a mid-size data set (~30k entries, 99% for training, 1% for testing) resulting in an accuracy of 0.47, output at https://github.com/fsteeg/python-data-analysis/blob/master/nwbib/nwbib-subjects-bulk-predict.csvThe bulk classification uses a different vectorizer than the small set experiments to make it work with larger data sets, so the results are not directly comparable. The mid-size and full-size setup however use the same setup, and yield very comparable results, so the mid-size setup should be a useful basis for further experiments with a runtime low enough to run multiple setups (which features to use, how to configure the vectorizer, which classifier to use and how to configure it).
Some areas to investigate next:
Should we reconsider this since this was tested NWBIB https://github.com/hbz/nwbib/issues/560
As announced in our internal planning document (AEP), we want to expand our expertise in text mining.
As a reasonable, well-defined, and useful project in that area, I suggest we should attempt to set up automatic classification for titles in our union catalog. More than half of our catalog has no subjects:
http://lobid.org/resources/search?q=NOT+_exists_%3Asubject http://lobid.org/resources/search?q=_exists_%3Asubject
The basic approach could be this: we use a part of the classified documents as our training set, and the rest of the classified documents as the gold standard to evaluate our classification method. The training and gold sets will have to contain a selection of documents across all subjects. With this basic setup, we can experiment with different features to represent a document and different classification algorithms.
When we get good results with our gold set, we can apply our classification method to the unclassified documents without subjects. These can get a new value for the
subject.source
, allowing them to be treated differently in queries and display.