Open dyf180615 opened 6 years ago
hey @dyf180615 , good questions! The problem is complex and I can't fix it for you with a GitHub comment, but here are a few tips:
1) Do not filter out stopwords. The model is designed to use them, so in principle should give better results without filtering. 2) 10000 documents for training word2vec embedding should be enough, but perhaps you could get better results by pretraining the embeddings on a different corpus. The only way to know is to check, but I wouldn't bet my money on it. I also wouldn't worry about the dictionary differences unless you expected the corpora to have vastly different vocabularies e.g. training on legal documents and predicting on physics papers. 3) What is probably happening is that your model is very eager to predict the most popular labels and very reluctant to predict the rare ones. That usually falsely increases your accuracy, but makes the model slightly biased. You might want to explore training and testing on a more balanced dataset where the label occurrences are more or less equal (the same amounts of documents with label 1 as with label 2 etc.) 4) Picking the threshold 0.2 as the cutoff also might favour the most popular labels. You might want to pick a different threshold for each label, to capture that. In the ideal scenario the thresholds should also be trained and fitted based on the data (but this is hard). 5) Try decreasing the Word2Vec vector size from 128 to 60-70. I would expect the accuracy to be the same and the training much faster. 6) I only used Magpie on European languages, so if your dataset is in Chinese, them some of those rules might not hold :)
Hope that helps a bit!
@jstypka thanks so much for your tips ,i saw your reply bofore i slept yesterday,i even dream about some tips similar to yours but i forget everything after waking up in th morning :) First of all, thank you very much for your suggestions. I will do the tests one by one, and follow-up feedback. Here are some of my confusion:
@jstypka After a while, there has been new progress in the last project, as follows: 1.Cancel the filter stoppage, obviously improve the effect, good idea! 2.The training of word vectors has been improved from 10,000 to 170,000 texts of the same type, and the result has increased by 2%. In the case of higher accuracy, I think it may be a good upgrade. 3.There is a problem in the test. For example, the A tag is a tag with a high proportion. The word with A1 or A2 will cause the prediction to be A tag, B be the tag with low proportion, B1 be the prone word of B tag, and the text will exist. If there is A1+B1 word distribution, A+B label is predicted, and the confidence is very good, but if A2+B1, A label will appear, the confidence is very high, B label will give very low confidence, even Being ignored, this is the place that puzzles me. Even A1>>>A, B1>>>B, C1>>>C, but A1+B1+C1>>>A, high confidence, B, C are ignored 4.Can you give me some advice on how to add attention-model to magpie? I think it can better identify the key to some extent.... 5.If I use 10,000 samples to train and use it to predict the 10000 itself, the result is very good, but if I use the 5000 out of the model to predict, the effect drops by 10%. Why? Looking forward to your reply,thanks
A very good tool! I have used this tool to test and use an application scenario. The scenario is as follows: Chinese document for environmental penalties, 26 labels indicating the type of problem existing in the enterprise, and labeling each document with 1-5 labels. The length of the document is 5000 in the range of 0-49 bytes, 3300 in 50-100 bytes, 1100 in 100-149 bytes, and 500 in 150-700 bytes. The problem with tags is that there are 3-4 tags that are relatively large, up to 80% cumulative, and these tags often appear alone or in combination, the degree of association is complex, and the text describing the tags is similar. The situation is this: I use 10,000 pieces of data for training, 8000 for training, 2000 for testing, cnn, word vector 128, batch-size64, epochs=10. The trained model is used directly to predict the 10,000 trained data. The result is this: I set the filter with a confidence greater than 0.2 as the result, and the top 5 for more than 5 tags. About 5,000 are completely correct, accounting for about 66%, and 90% of them are single-label, but only about 5% is completely wrong. A complete error means that a correct label is not predicted. 15% are labels with multiple predictions, but there are also labels that hit the correct ones, and 5% are labels with less predictions, meaning that some of the labels are not correctly predicted, but there are also hits with the correct labels. I am very satisfied with the results of the single label. However, the problems in multi-labels are also very prominent. Multi-prediction is a very headache, and most of the multi-predictions and low-predictions are the 3-4 labels that I have mentioned, which are more difficult to distinguish. So now you feel how I can improve the accuracy to what I want: 95% full pair of single tags, reducing the number of multi-predicted tags, so that the overall correct number of tags reaches 80-90%, may be difficult? And there are some additional questions: 1,I try to add stop words that I think can be removed when the word segmentation, and some common words are mandatory as a dictionary. Does this help the result? The result I get is the ability to identify the difference between the parts of the label, but it also reduces the confidence of some labels that were previously highly trusted. 2, the training of the word vector, I use 10000 pieces of data for word vector training, will it be impossible to correctly establish the appropriate distance space for each word vector? If I use other people's 120G Chinese corpus to train good word vectors to use, is it better? Do you need to worry about the different dictionaries? (I used a custom dictionary extra) 3,Regarding the attention mechanism, is this mechanism applicable to my scenario, increasing the model's attention to some specific words in the text, and how to apply it to magpie? 4, Other methods that you feel are likely to increase the accuracy of prediction with greater precision. Thank you very much, I hope to get your help.