seraphlabs-ca / SentenceMIM-demo

This repo contains code to reproduce some of the results presented in the paper "SentenceMIM: A Latent Variable Language Model"
MIT License
28 stars 4 forks source link

What happened to your SOTA result? #5

Closed LifeIsStrange closed 4 years ago

LifeIsStrange commented 4 years ago

https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word

GPT 3 has been released, improving the previous SOTA from 35.76 to 20.5 which is a huge gain. BUT before the release of GPT 3,I had seen YOUR results and they were so exceptional that I wrote them on a note!

SMIM had a result of 4.6 of text perplexity on the Pen treebank benchmarck and with order of magnitude less parameters than GPT-2. WHAT HAPPENENED?? humanity needs those accuracy gains!

LifeIsStrange commented 4 years ago

@michalivne

LifeIsStrange commented 4 years ago

The results are still in the paper btw with SMIM 1024 179M Were the results erroneous, unreproducible? Otherwise, you are loosing your momentum against GPT 3

LifeIsStrange commented 4 years ago

https://paperswithcode.com/paper/200302645

LifeIsStrange commented 4 years ago

Someone on hackernews claims:

the MELBO bound in that paper is invalid. Their perplexity numbers using the MELBO bound are also invalid.

Is that true? With the correct MELBO how much is the text perplexity on Pen Treebank affected?

LifeIsStrange commented 4 years ago

The bound is completely invalid, as are the NLL/PPL numbers they report with the MELBO. Look at the equation. If they optimized it directly, it would be trivially driven to 0 by the identity function if we used a latent space equivalent to the input space. The MELBO just adds noiseless autoencoder reconstruction error to a constant offset equal to log of the test set size. This can be driven to zero by evaluating an average bound over test sets of size 1. The mathematical/conceptual error is that they are assuming each test point is added to the "post-hoc aggregated" prior when they evaluate the bound. This is analogous to including a test point in the training set. Another version of this error would be adding a kernel centered on each test point to a kernel density estimator prior to evaluating test set NLL. In this case, obviously the best kernel has variance 0 and assigns arbitrarily high likelihood to the test data.

michalivne commented 4 years ago

@LifeIsStrange thanks for your interest. We are very excited about sentenceMIM and believe it is a valuable model that should be further investigated. Regarding your comments/questions: The MELBO bound is valid, however there is a question regarding what exactly PPL measures when it is computed with MELBO. We chose to remove the PPL results for the time being, since comparing it directly to other probability density estimators is suspicious, as explained below.

The issue with the reported numbers is not their validity, but rather the validity of the comparison. PPL (i.e., a per-token probability-related measure) is commonly used to test the generalization of a learned model under a held out sample set. Since the weights of sentenceMIM are independent of the test sample set, sentenceMIM can be viewed as a transductive model (i.e., the prior depends on the test sample points, and is then marginalized out). This view, however, raises the question whether the PPL numbers reported for such a transductive model are comparable to values reported by a model like GPT-2. As such, we chose to remove the results for the time being, until we complete further investigation.

Alternatively, MELBO can be viewed as measuring the ability of sentenceMIM to disambiguate between a set of given observations (i.e., sentences). In that view, sentenceMIM PPL values cannot be compared to PPL values of other PDF estimators (i.e., we cannot compare apples to oranges). This view is reflected in your last comment in which the model can be evaluated independently on each test point. Under this view, the reported PPL values are indeed not a valid measure of the generalization of the model under a target sample set (and despited the independence of the learned weights from the test sample set).

As a side not, matching the dimensionality of the latent to the observations space is not directly possible since the observation space is discrete, and the latent space is continuous.

Despite the issues in the reported PPL values, MELBO is never used in training, and was introduced solely for the purpose of comparison to existing models. Nevertheless, we still demonstrate superior BLEU score (i.e., compared to VAE and AE), and SOTA Q & A results (i.e., for single task models). Those results are valid and are independent from the issues introduced by MELBO. We believe that sentenceMIM is still a promising model that might be of interest to the research community.

I hope my reply answers your questions.

LifeIsStrange commented 4 years ago

@michalivne thank you for the explanation, sadly I have not enough expertise to fully understand it (I'm not a researcher) but it does not matter! Personally, PPL on pen Treebank is not what really interest me.

Nevertheless, we still demonstrate superior BLEU score (i.e., compared to VAE and AE), and SOTA Q & A results (i.e., for single task models).

This is a big achievement! I fail to understand what kind of Nlp tasks is not suited for SMIM. Is it as general as BERT/xlnet? Because GPT-2 is not as general as those (only specialized for "generative" tasks?)

I want to accomplish AGI by building the first true semantic parser for natural language. In order to do that I depend on lower level tasks such as: POS tagging Dependency parsing Coreference resolution

Their errors all add up and the current state of the art is bad honestly. To that end, training SMIM to those angular, fundamental Nlp tasks seems to me like a priority as it might have the potential to beat the state of the art. What do you think about this goal? Will you pursue it or do you have other priorities?

For example it seems simple to beat the current coreference resolution SOTA https://github.com/sebastianruder/NLP-progress/blob/master/english/coreference_resolution.md By just using the latest SOTA implementation and replacing their pretrainer (spanBERT) by either XLnet or SMIM.

michalivne commented 4 years ago

Hi,

I am currently working on a more complete demonstration of sMIM use cases for NLP. I will update the repository once the pre-print is updated.

Generally speaking, MIM is a general learning framework (closely related to VAE), and as such sMIM has no inherent limitations regarding downstream tasks. Compared to BERT, sMIM is also a generative model, not only a representation learning model. As a result it is very easy to perform tasks such as Q & A, for instance. I wish you all the best with your goals (I share similar goals), and I encourage you to share any success stories with me. I will be happy to add your contributions to the sentenceMIM repo (and credit you, of course).

Best regards,

Micha

On Mon, Jun 1, 2020 at 6:39 AM LifeIsStrange notifications@github.com wrote:

@michalivne https://github.com/michalivne thank you for the explanation, sadly I have not enough expertise to fully understand it (I'm not a researcher) but it does not matter! Personally, PPL on pen Treebank is not what really interest me.

Nevertheless, we still demonstrate superior BLEU score (i.e., compared to VAE and AE), and SOTA Q & A results (i.e., for single task models).

This is a big achievement! I fail to understand what kind of Nlp tasks is not suited for SMIM. Is it as general as BERT/xlnet? Because GPT-2 is not as general as those (only specialized for "generative" tasks?)

I want to accomplish AGI by building the first true semantic parser for natural language. In order to do that I depend on lower level tasks such as: POS tagging Dependency parsing Coreference resolution

Their errors all add up and the current state of the art is bad honestly. To that end, training SMIM to those angular, fundamental Nlp tasks seems to me like a priority as it might have the potential to beat the state of the art. What do you think about this goal? Will you pursue it or do you have other priorities?

For example it seems simple to beat the current coreference resolution SOTA

https://github.com/sebastianruder/NLP-progress/blob/master/english/coreference_resolution.md By just using the latest SOTA implementation and replacing their pretrainer (spanBERT) by either XLnet or SMIM.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/seraphlabs-ca/SentenceMIM-demo/issues/5#issuecomment-636775234, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY3ZWLBOW7MWMREIRRWSTLRUOAOZANCNFSM4NOMYOCQ .

LifeIsStrange commented 4 years ago

Thanks for the kind answer :)

I encourage you to share any success stories with me. The only success story I have was to implement the first natural language syllogism checker. It takes sentences as inputs and check if there is a syllogism, if the syllogism has a formal error (sophism/fallacy). I want to extend this analysis to more kind of logical fallacies, and also for propositional logic. But before doing more advanced analyses I need to serialize a memory of the past semantics into a database and to be more robust to rich sentences, e.g split losslessly "This guy is talented and kind." Into This guy is talended. This guy is kind.

In order to make them more easily processable

But because I'm still a computer engeenering student I only do this on my spare time so progress is slow :) I hope to have more to show this summer!

Keep up the good work!