match dataset and emotions indexes

maciejskorski commented 9 months ago

When creating the dataset by

import pandas as pd
from pathlib import Path

def open_fn(f):
    try:
        return pd.read_csv(f,engine='python')
    except:
        return pd.DataFrame()

tweets2 = pd.concat([
    pd.concat(map(open_fn, Path('../data/futurists_kol/data').rglob('*csv'))),
    pd.concat(map(open_fn, Path('../data/futurists_rossdawson/data').rglob('*csv')))
])

tweets2.columns = ['index','user','timestamp','url','txt']
tweets2 = tweets2.drop_duplicates(subset=['txt'])
tweets2.reset_index(inplace=True,drop=True)

print(tweets2.loc[3,'txt'])

and comparing the emotions for the chosen text

The fight over what AIs say and do has just started, and will never end.

we find two different results.

Namely

doc2emotion = pd.read_pickle("emotions.pkl")
doc2emotion.loc[3]

gives

sadness         0.390508
disgust         0.955647
anger           0.981497
pessimism       0.076873
fear            0.029073
anticipation    0.028830
surprise        0.016065
joy             0.020227
optimism        0.011874
love            0.005367
trust           0.005419
Name: 3, dtype: float64

and from

from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-emotion-multilabel-latest", top_k=25)
pipe(tweets2.loc[3,'txt'])

we obtain

[[{'label': 'anger', 'score': 0.8605059385299683},
  {'label': 'disgust', 'score': 0.5500710010528564},
  {'label': 'optimism', 'score': 0.3988407254219055},
  {'label': 'anticipation', 'score': 0.19056400656700134},
  {'label': 'pessimism', 'score': 0.09352655708789825},
  {'label': 'sadness', 'score': 0.07335002720355988},
  {'label': 'fear', 'score': 0.06205078586935997},
  {'label': 'trust', 'score': 0.05718396231532097},
  {'label': 'joy', 'score': 0.02284206822514534},
  {'label': 'surprise', 'score': 0.00987330824136734},
  {'label': 'love', 'score': 0.005672035273164511}]]

xaru commented 9 months ago

obawiam sie ze tu jest wiecej ciekawych rzeczy :) jak odpalam Twój kod, to pod tym indeksem mam taki tekst:

import pandas as pd
from pathlib import Path

def open_fn(f):
    try:
        return pd.read_csv(f,engine='python')
    except:
        return pd.DataFrame()

tweets2 = pd.concat([
    pd.concat(map(open_fn, Path(repo_path/'data/futurists_kol/data').rglob('*csv'))),
    pd.concat(map(open_fn, Path(repo_path/'data/futurists_rossdawson/data').rglob('*csv')))
])

tweets2.columns = ['index','user','timestamp','url','txt']
tweets2 = tweets2.drop_duplicates(subset=['txt'])
tweets2.reset_index(inplace=True,drop=True)

print(tweets2.loc[3,'txt'])

@sendavidperdue But you supported Trump... and his lies and behavior. Biiiiiiiig mistake. But, nice try..loser

a potencjalne różnice mogą wynikac z preprocessingu - jesli wywalimy tego usera z przodu (a ja wywalam) to wyniki są ciut inne niz gdy go zostawiamy:

But you supported Trump... and his lies and behavior. Biiiiiiiig mistake. But, nice try..loser
[{'label': 'anger', 'score': 0.9814966917037964},
 {'label': 'disgust', 'score': 0.9556466341018677},
 {'label': 'sadness', 'score': 0.3905079960823059},
 {'label': 'pessimism', 'score': 0.07687253504991531},
 {'label': 'fear', 'score': 0.029073450714349747},
 {'label': 'anticipation', 'score': 0.0288297887891531},
 {'label': 'joy', 'score': 0.020227165892720222},
 {'label': 'surprise', 'score': 0.016065144911408424},
 {'label': 'optimism', 'score': 0.011874457821249962},
 {'label': 'trust', 'score': 0.005419053602963686},
 {'label': 'love', 'score': 0.005367077421396971}]

vs

@sendavidperdue But you supported Trump... and his lies and behavior. Biiiiiiiig mistake. But, nice try..loser
[{'label': 'anger', 'score': 0.9818358421325684},
 {'label': 'disgust', 'score': 0.9528217911720276},
 {'label': 'sadness', 'score': 0.2592710554599762},
 {'label': 'pessimism', 'score': 0.06370002031326294},
 {'label': 'anticipation', 'score': 0.03384344279766083},
 {'label': 'fear', 'score': 0.026571445167064667},
 {'label': 'joy', 'score': 0.026189832016825676},
 {'label': 'surprise', 'score': 0.016805190593004227},
 {'label': 'optimism', 'score': 0.01543356291949749},
 {'label': 'trust', 'score': 0.005702258553355932},
 {'label': 'love', 'score': 0.004971153102815151}]

maciejskorski commented 9 months ago

To porządek scalania plików jest odmienny i zależny od OS. Wymieńmy się porządkiem czytania plików:

files1 = Path('../data/futurists_kol/data').rglob('*csv')
files2 = Path('../data/futurists_rossdawson/data').rglob('*csv')
files = itertools.chain(files1,files2)

with open('account_list.txt','wt') as f:
    for fpath in files:
        f.write(fpath.name+'\n')

Tu jest mój account_list.txt

maciejskorski / anticipatio

match dataset and emotions indexes #6