Use supervised learning as the baseline approaches

alexanderpanchenko commented 6 years ago

Related work search on the supervised prediction: Look at all papers that cite Reddy/Redyy++/Fahramand. http://www.aclweb.org/anthology/I11-1024 http://www.aclweb.org/anthology/P16-2026 http://www.aclweb.org/anthology/W15-0904 (write a script that searches the word 'supervised' in the downloaded pdf files).
Read http://www.aclweb.org/anthology/P16-1187.
Use the datasets Reddy, Reddy++, Farahmand to train supervised classifiers (sklearn all + neural classifiers using keras) to get a new baseline.

Use fahramand dataset (about 1k instances) to train/develop and Reddy/Reddy++ to test.
Use first simple modes: linear models first (SVR, LR,...)
Use later more complex models: feedforward NN, Gradient Boosting Regression, etc.
Using the learning to rank methodology:

Compare with the original results in (1) and add the results to Google Doc.

adhaesitadimo commented 6 years ago

Main non-ensemble supevised models are ready for word2vec-type baseline embeddings

Linear Regression
Support Vector Regression
Kernel Regression
SGD Regression
K Nearwest Neighb Regression
PLS Regression
Decision Tree For SVR and LR, different feature approaches were used (cosine distance, euclidian distance and raw vector difference). The latter didn't really work out.

Trained on Farahmand dataset, tested on Reddy, Reddy++ and Farahmand (by 5-fold cv). Results can be seen in the doc.

adhaesitadimo commented 6 years ago

Also added more supervised approaches from Farahmand et al. article and unsupervised from Lioma et al. to results section

alexanderpanchenko commented 6 years ago

https://fasttext.cc/docs/en/english-vectors.html

https://github.com/mmihaltz/word2vec-GoogleNews-vectors

alexanderpanchenko commented 6 years ago

ukWaC: http://panchenko.me/data/joint/corpora/en59g/ukwac-noxml.txt.gz

adhaesitadimo commented 6 years ago

Evaluations done for three 750-d vectors as features; farahmand dataset was rescaled to Reddy scale, negative correlation problem eradicated

alexanderpanchenko commented 6 years ago

Thanks! Can you please also generate

LR concat 750x2 SVR concat 750x2 KR concat 750x2 SGD concat 750x2 KNN concat 750x2 PLS concat 750x2 Tree concat 750x2

where one of the 750 dimensional embedding is a sum of individual words and the other is the compound embeddings?

Example:

hot+dog = 750 dims

hot_dot = 750 dims

On 20 Aug 2018, at 11:15, Dmitri notifications@github.com wrote:

Main non-ensemble supevised models are ready for word2vec-type baseline embeddings

Linear Regression Support Vector Regression Kernel Regression SGD Regression K Nearwest Neighb Regression PLS Regression Decision Tree For SVR and LR, different feature approaches were used (cosine distance, euclidian distance and raw vector difference). The latter didn't really work out. Trained on Farahmand dataset, tested on Reddy, Reddy++ and Farahmand (by 5-fold cv). Results can be seen in the doc.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/melting_pot/issues/2#issuecomment-414251368, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6voiMVpSu23XxQItmPjmq4QLn0Coqks5uSn46gaJpZM4Vwh0a.

adhaesitadimo commented 6 years ago

Random Forest model evaluation is ready and can be seen in the table

adhaesitadimo commented 6 years ago

Predictions for cross-sense Sensegram cosines are ready (n_features=max amount of cosines=72) in the table. Worse than baseline.

uhh-lt / poincare

Use supervised learning as the baseline approaches #2