inspirehep / magpie

Deep neural network framework for multi-label text classification
MIT License
684 stars 192 forks source link

Question about prediction accuracy improvement #161

Open dyf180615 opened 6 years ago

dyf180615 commented 6 years ago

A very good tool! I have used this tool to test and use an application scenario. The scenario is as follows: Chinese document for environmental penalties, 26 labels indicating the type of problem existing in the enterprise, and labeling each document with 1-5 labels. The length of the document is 5000 in the range of 0-49 bytes, 3300 in 50-100 bytes, 1100 in 100-149 bytes, and 500 in 150-700 bytes. The problem with tags is that there are 3-4 tags that are relatively large, up to 80% cumulative, and these tags often appear alone or in combination, the degree of association is complex, and the text describing the tags is similar. The situation is this: I use 10,000 pieces of data for training, 8000 for training, 2000 for testing, cnn, word vector 128, batch-size64, epochs=10. The trained model is used directly to predict the 10,000 trained data. The result is this: I set the filter with a confidence greater than 0.2 as the result, and the top 5 for more than 5 tags. About 5,000 are completely correct, accounting for about 66%, and 90% of them are single-label, but only about 5% is completely wrong. A complete error means that a correct label is not predicted. 15% are labels with multiple predictions, but there are also labels that hit the correct ones, and 5% are labels with less predictions, meaning that some of the labels are not correctly predicted, but there are also hits with the correct labels. I am very satisfied with the results of the single label. However, the problems in multi-labels are also very prominent. Multi-prediction is a very headache, and most of the multi-predictions and low-predictions are the 3-4 labels that I have mentioned, which are more difficult to distinguish. So now you feel how I can improve the accuracy to what I want: 95% full pair of single tags, reducing the number of multi-predicted tags, so that the overall correct number of tags reaches 80-90%, may be difficult? And there are some additional questions: 1,I try to add stop words that I think can be removed when the word segmentation, and some common words are mandatory as a dictionary. Does this help the result? The result I get is the ability to identify the difference between the parts of the label, but it also reduces the confidence of some labels that were previously highly trusted. 2, the training of the word vector, I use 10000 pieces of data for word vector training, will it be impossible to correctly establish the appropriate distance space for each word vector? If I use other people's 120G Chinese corpus to train good word vectors to use, is it better? Do you need to worry about the different dictionaries? (I used a custom dictionary extra) 3,Regarding the attention mechanism, is this mechanism applicable to my scenario, increasing the model's attention to some specific words in the text, and how to apply it to magpie? 4, Other methods that you feel are likely to increase the accuracy of prediction with greater precision. Thank you very much, I hope to get your help.

jstypka commented 6 years ago

hey @dyf180615 , good questions! The problem is complex and I can't fix it for you with a GitHub comment, but here are a few tips:

1) Do not filter out stopwords. The model is designed to use them, so in principle should give better results without filtering. 2) 10000 documents for training word2vec embedding should be enough, but perhaps you could get better results by pretraining the embeddings on a different corpus. The only way to know is to check, but I wouldn't bet my money on it. I also wouldn't worry about the dictionary differences unless you expected the corpora to have vastly different vocabularies e.g. training on legal documents and predicting on physics papers. 3) What is probably happening is that your model is very eager to predict the most popular labels and very reluctant to predict the rare ones. That usually falsely increases your accuracy, but makes the model slightly biased. You might want to explore training and testing on a more balanced dataset where the label occurrences are more or less equal (the same amounts of documents with label 1 as with label 2 etc.) 4) Picking the threshold 0.2 as the cutoff also might favour the most popular labels. You might want to pick a different threshold for each label, to capture that. In the ideal scenario the thresholds should also be trained and fitted based on the data (but this is hard). 5) Try decreasing the Word2Vec vector size from 128 to 60-70. I would expect the accuracy to be the same and the training much faster. 6) I only used Magpie on European languages, so if your dataset is in Chinese, them some of those rules might not hold :)

Hope that helps a bit!

dyf180615 commented 6 years ago

@jstypka thanks so much for your tips ,i saw your reply bofore i slept yesterday,i even dream about some tips similar to yours but i forget everything after waking up in th morning :) First of all, thank you very much for your suggestions. I will do the tests one by one, and follow-up feedback. Here are some of my confusion:

  1. I think that there is a big difference between Chinese and European languages. European language words are separated by spaces, and the word segmentation is clear. However, Chinese needs to segment words for specific application scenarios. The combination of words into words is too complicated, and the unrelated interference is more complicated. Much, I think maybe the design of the stop word will make the content of the document more focused. Maybe the design of the model mechanism can capture this kind of interference. I don't quite understand whether the model is based on word distribution or semantic recognition. I even I want to streamline a document by adding a dictionary to a few keywords that can be artificially judged! A sentence or a paragraph becomes a few words that determine the results of different labels! But maybe you are right, but how can the model recognize the focus of a paragraph on some of the words? Through the analysis of the test, there are quite a few strange predictions. The sentence is very simple and the keywords are very With clear documentation, the model will not be able to correctly predict the labels that I thought would be easily predicted. So I am also wondering if it would be better not to remove the stop words, I will test it. 2.The corpus is Baidu Encyclopedia's entry 20G and a large number of novels 90G and Sohu News 12G. I think this can cover almost all vocabulary in a unified vector space, because my data set has a distinctive feature that the description is more solid. In theory, it should be easier to be classified, but because the accuracy requirements are higher, it is probably like the model can distinguish the difference between basketball and football, but it is not necessarily able to distinguish the difference between left-back, center-back and right-back in football. Basically, there are many parts of the label (3-4 are more important and less discriminative). When manually sub-labeling, you need to use the human brain to understand the semantics to be able to give labels, instead of seeing a certain keyword. In this case, can I expect the model to give reasonable and accurate predictions? 3.How should I prepare the data set to balance the number of tags? Considering that the combination of labels is more than the number of labels alone, I think it would be easy to reduce the number of labels, but I am worried that the effect will decrease, because the model requires more data as possible! This is called dimensionality sampling? If you add each type of label or a combination of labels, what is the appropriate amount for each of the total number of 10,000? For example, A, B, C labels, A accounted for a large proportion, how can I increase the label of B, C, BC? Considering that I have a total of 26 labels, I really feel that my head is big. I am curious that the labels in your scene should be much more than ours. I am curious about how you train your models, or based on a fairly large collection of data, and how a model can be fed back based on predicted results. Model so that he can improve his abilities under the wrong experience? 4.I really want to set different thresholds for each tag. Even the uniform tags on different orders should have different thresholds. Obviously, the confidence gap between the first tag and the fifth tag is very different. Big. Is the threshold trained and fitted to the values? How to do it! Provide some specific ideas? Sorry to write such a long paragraph, it may be my lack of ability to express = =. I hope to keep in touch with you in this field. It is some exchange of ideas and tech. I am very interested in the field of neural networks. I call it the realm of God. The above is some of my views, I look forward to your views :)
dyf180615 commented 6 years ago

@jstypka After a while, there has been new progress in the last project, as follows: 1.Cancel the filter stoppage, obviously improve the effect, good idea! 2.The training of word vectors has been improved from 10,000 to 170,000 texts of the same type, and the result has increased by 2%. In the case of higher accuracy, I think it may be a good upgrade. 3.There is a problem in the test. For example, the A tag is a tag with a high proportion. The word with A1 or A2 will cause the prediction to be A tag, B be the tag with low proportion, B1 be the prone word of B tag, and the text will exist. If there is A1+B1 word distribution, A+B label is predicted, and the confidence is very good, but if A2+B1, A label will appear, the confidence is very high, B label will give very low confidence, even Being ignored, this is the place that puzzles me. Even A1>>>A, B1>>>B, C1>>>C, but A1+B1+C1>>>A, high confidence, B, C are ignored 4.Can you give me some advice on how to add attention-model to magpie? I think it can better identify the key to some extent.... 5.If I use 10,000 samples to train and use it to predict the 10000 itself, the result is very good, but if I use the 5000 out of the model to predict, the effect drops by 10%. Why? Looking forward to your reply,thanks