tanthongtan / dv-cosine

34 stars 17 forks source link

How can I test it for my dataset? #2

Closed luckysunda closed 2 years ago

luckysunda commented 4 years ago

I have a text file of sentences and paragraphs. How can I predict the polarity for those using this model? As it requires embedding, How can I get embedding for my text file?

Gouber commented 3 years ago

Have you managed to find an answer to this?

perathambkk commented 2 years ago

I suspect the existence of the first author. Probably a social experiment (again. like SCIgen https://pdos.csail.mit.edu/archive/scigen/).

And I feel the paper had been invalidated already [1]. If you know out-of-sample/vocab, this is not going to fool you. Please don't use the technique and stop citing this misconducted work.

In fact, I wonder why they didn't edit or retract the paper. They seem to reproduce the error in the RepL workshop doc2vec paper where the authors seemed not to know what an out-of-sample-extension/vocab is.

Look at the rising citation count -*- image and the papers citing this work... https://scholar.google.co.th/scholar?oi=bibs&hl=th&cites=12336011503386939809

[1] Bingyu, Zhang, and Nikolay Arefyev. "The Document Vectors Using Cosine Similarity Revisited." Proceedings of the Third Workshop on Insights from Negative Results in NLP. 2022.

tanthongtan commented 2 years ago

I suspect the existence of the first author. Probably a social experiment (again. like SCIgen https://pdos.csail.mit.edu/archive/scigen/).

And I feel the paper had been invalidated already [1]. If you know out-of-sample/vocab, this is not going to fool you. Please don't use the technique and stop citing this misconducted work.

In fact, I wonder why they didn't edit or retract the paper. They seem to reproduce the error in the RepL workshop doc2vec paper where the authors seemed not to know what an out-of-sample-extension/vocab is.

Look at the rising citation count -*- image and the papers citing this work... https://scholar.google.co.th/scholar?oi=bibs&hl=th&cites=12336011503386939809

[1] Bingyu, Zhang, and Nikolay Arefyev. "The Document Vectors Using Cosine Similarity Revisited." Proceedings of the Third Workshop on Insights from Negative Results in NLP. 2022.

Hello there, there is indeed a bug in implementation of the ensemble methods presented in the paper. The paper you mentioned "The Document Vectors Using Cosine Similarity Revisited" details very well the exact issue with the implementation. A revision to the original paper is underway and this Github will be updated with the correct implementation very shortly.

In short, the problem comes from the incorrect concatenation of two different documents of the same class, resulting from a mistake where the datasets used to build the two representations (DV-ngrams-cosine and NB-weighted-BON) are not ordered the same. Thankfully, this means that the rest of the methods (all the non-ensemble methods) are completely unaffected.

perathambkk commented 2 years ago

[1] provides their implementation too. (At https://github.com/bgzh/dv_cosine_revisited.) Also, some researchers from the ARK lab, University of Washington, suspected about how you use the data in their paper. Gururangan, Suchin, et al. "Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. image

This seems to be addressed in [1] as well. You may take a look at a notebook in their repo. https://github.com/Bgzh/dv_cosine_revisited/blob/main/compare_2_ways_of_concatenation.ipynb

I just somehow came across your paper and already had some conversation with Prof. Tanasanee via email. I wonder why your proposed method got SOTA for too long (3 years and ongoing). It's just not typical given the fast pace of this field. There's a paper tributing to a deceased professor in the field mentioning your result too. Poria, Soujanya, et al. "Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research." IEEE Transactions on Affective Computing (2020). image image

This becomes quite something now. At first, I thought Prof. Tanasanee participated in the SCIgen project -*-. Please revise and correct.

Best, Peratham

tanthongtan commented 2 years ago

Thanks for your concern, a revision to the paper and Github will be done ASAP.

perathambkk commented 2 years ago

It's quite a serious ethical research issue. (Possible, a result forgery or a research misconduct.) But nobody really cares as I can see. So, I just say what I can say without reporting directly to the ACL. I don't even confirm your existence. Also, I am not sure what you can do for those many papers citing your results. It has been years.

https://aclrollingreview.org/responsibleNLPresearch/

tanthongtan commented 2 years ago

As I’ve explained, a mistake was made in the ordering of the two sets of documents used to build the ensembles. That definitely was not intentional. I had at the time checked to the best of my knowledge the code and everything to make sure the result of the ensembles was correct. Obviously, it wasn’t as I had failed to check that each concatenated document representation actually came from the same document. You could call it carelessness on my part for failing to check that the two datasets used to build the ensembles were in the same order, but please refrain from unfounded accusations of result forgery.

I’ve been completely transparent about all methods since the paper’s publication. The code and data used for all the experiments have been open source for years. That has allowed Zhang and Arefyev to pinpoint the exact issue with the ensembles in the paper “The Document Vectors Using Cosine Similarity Revisited”, of which I am greatly appreciative of. I am extremely overjoyed that they’ve uncovered a three year long mystery as to why the accuracy of the ensemble is so high.

They’ve discovered a bug in the implementation of the ensembles, most notably that the dv-ngrams-cosine + NB-weighted-BON ensemble accuracy of 97.42% should actually be reduced to 93.68%. Yes, the actual accuracy is lower but it doesn’t invalidate the method. In that very same paper they’ve also improved the original dv-cosine method as well as show that the dv-cosine ensemble can outperform RoBERTa when the training set is very low (10 or 20 documents). Finally, they also show that the DV-ngrams-cosine with NB sub-sampling + RoBERTa.base ensemble still achieves #4 on IMDb. This is a great presentation by the authors on the entire topic: https://underline.io/events/284/sessions/10984/lecture/52733-the-document-vectors-using-cosine-similarity-revisited. Overall the dv-ngrams-cosine ensemble still gives competitive accuracy considering that it’s a relatively small model, isn’t a transformer, and doesn’t use external data.

Another note is that the ensembles are not an integral part of the original paper. They are just extra experiments but have little to no effect whatsoever on the main findings of the paper. They are only mentioned in abstract and in section 4.4 of the paper. The paper showed that using cosine similarity instead dot product when training document vectors improves performance on IMDb. The bug with the concatenation only affects the ensembles and has zero effect on the main findings or experiments of the paper.

I’m not sure what you mean by confirming my existence, you are talking to me right now. If you want an interview on video, I’m down for that too. You can also always contact me via email: tan.thongtan1@gmail.com. I’ll even give you my line ID if you want.

Finally, ever since I’ve discovered the Zhang and Arefyev paper a couple of weeks ago, I’ve been doing my best to recheck everything and submit an accurate revision. I’ve been rerunning all the ensembles with the right ordering to confirm the experiments. I intend to correct the incorrectly high accuracies as well as confirm and explain the reason for the previously published numbers for the ensembles. On all the citations of “state of the art”, it deeply saddens me that my mistake has misled so many. Of course, it is a mistake that I won’t forget. All I want is the truth to be out there. I can only now do my best to correct everything that I can and reemphasize everything we’ve learned from this mistake.

perathambkk commented 2 years ago

Also, there is a data leakage issue in your experiments. As pointed out by many groups, for example in https://github.com/bgzh/dv_cosine_revisited. image

Please consult your advisor about the research ethics in your case, if you exist.

It's not my business to tell you what to do. I can even just file the report to the ACL.

Also, the results and numbers are not everything. The science which lets others build upon that what really counts.

tanthongtan commented 2 years ago

Also, there is a data leakage issue in your experiments. As pointed out by many groups, for example in https://github.com/bgzh/dv_cosine_revisited. image

Please refer to my previous replies. I've talked about the Zhang and Arefyev paper multiple times.

Please consult your advisor about the research ethics in your case, if you exist.

Sure.

It's not my business to tell you what to do. I can even just file the report to the ACL.

...Okay? Since I've seen the Zhang and Arefyev paper weeks ago I've been working on corrections.

Also, the results and numbers are not everything. The science which lets others build upon that what really counts.

Exactly, in the link you sent me (https://github.com/bgzh/dv_cosine_revisited) they indeed built upon my original dv-cosine method. Refer to the "Naive Bayesian Sub-sampling" section.

I've only tried to answer any concerns best I could. I am well aware of (Zhang and Arefyev) as I've been reading it multiple times for weeks now. If you have any points, constructive feedback or have found any errors outside of the Zhang and Arefyev paper, feel free to comment. However, between the Zhang and Arefyev paper as well as my previous comments, I believe most of the problems with the paper (which is once again limited to the ensembles) have been addressed. I will be focusing on the revision now.

perathambkk commented 2 years ago

You don’t need to be defensive on these kinds of issues. Be honest. Have good faith.

https://www.enago.com/academy/handling-errors-published-paper-tips-authors/

On 13 Sep BE 2565, at 17:41, Tan Thongtan @.***> wrote:

 Also, there is a data leakage issue in your experiments. As pointed out by many groups, for example in https://github.com/bgzh/dv_cosine_revisited.

Please refer to my previous replies. I've talked about the Zhang and Arefyev paper multiple times.

Please consult your advisor about the research ethics in your case, if you exist.

Sure.

It's not my business to tell you what to do. I can even just file the report to the ACL.

...Okay? Since I've seen the Zhang and Arefyev paper weeks ago I've been working on corrections.

Also, the results and numbers are not everything. The science which lets others build upon that what really counts.

Exactly, in the link you sent me (https://github.com/bgzh/dv_cosine_revisited) they indeed built upon my original dv-cosine method. Refer to the "Naive Bayesian Sub-sampling" section.

I've only tried to answer any concerns best I could. I am well aware of (Zhang and Arefyev) as I've been reading it multiple times for weeks now. If you have any points, constructive feedback or have found any errors outside of the Zhang and Arefyev paper, feel free to comment. However, between the Zhang and Arefyev paper as well as my previous comments, I believe most of the problems with the paper (which is once again limited to the ensembles) have been addressed. I will be focusing on the revision now.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.