ilopezgazpio / PhrasIS-baselines

Repository for "PhrasIS: phrase inference and similarity benchmark" paper
MIT License
1 stars 0 forks source link

Data preproces #5

Open ilopezgazpio opened 3 years ago

amaiazarranz commented 2 years ago

Parte generiko

1.- Pentsatu ze aurreprozesu beharko duten left eta right zutabeak -> left_strip / right_strip : .strip() -> left_strip_tokenized : nltk koa baino stop word gabe https://stackabuse.com/removing-stop-words-from-strings-in-python/ bektore_emaitza = [word for word in text_tokens] -> left_strip_tokenized_noPunct : https://www.kite.com/python/answers/how-to-remove-all-punctuation-marks-with-nltk-in-python -> left_strip_tokenized_noPunct_noStopWords : https://stackabuse.com/removing-stop-words-from-strings-in-python/

-> left_strip_tokenized_noPunct_stopwords : https://stackabuse.com/removing-stop-words-from-strings-in-python/

baino if a inbersoa

array_stopwords = [word for word in text_tokens IF word in stopwords.words()]

left_strip_tokenized_noPunct_noStopWords = content_words

Feature-ak

1- Jackard overlap left_strip_tokenized (with punctuations) 2- Jaccard overlap of content words 3- Jackard overlap of stopwords

4- Difference in length between chunks 1 and 2 5- Difference in length between chunks 2 and 1

RAW

ONTHOLOGY -> left_strip_tokenized_noPunct_Nostopwords https://www.nltk.org/howto/wordnet.html#similarity

Nota: ez dakit ze pasakodan adb: dogs , pasado berboak lematizatu -> encontrar la raiz de una palabra -> ate -> eat

https://aclanthology.org/N04-3012.pdf

LCS -> Least Common Subsumer -> ancestro comun mas cercano

simulate_root=False

6- Max WordNet PATH SIMILARITY of all sense pairs (Pedersen et al., 2004) dog.path_similarity(cat) 7- Max WordNet LCH similarity of sense pairs (Leacock and Chodorow, 1998) dog.lch_similarity(cat) 8- Max WordNet JCN similarity of sense pairs (Jiang and Conrath, 1997) jcn_similarity

?? - WUP, nahi badezu ?? - res_similarity ?? - lin_sim

simulate_root=True ((onthology root is the LCS) -> "normally turrns to give higher similarity scores")

9- Same as 6 but simulating root with the maximum common subsumer 10- Same as 7 but simulating root with the maximum common subsumer 11- Same as 8 but simulating root with the maximum common subsumer


WORDNET RAW 12- Whether chunk 1 senses are more specific than chunk 2 senses in the WordNet hierarchy (Fellbaum, 1998) 13- Whether chunk 2 senses are more specific than chunk 1 senses in the WordNet hierarchy -> begiratzea wordneteko nibela -> sakonena da espezifikoena return true if depth(w1) > depth(w2) # depth haundigo, mas especifico

14- Difference in WordNet depth of segment head -> 12 eta 13ren arteko aldea 15- Minimum value of pairwise difference of WordNet depth 16- Maximum value of pairwirse difference of WordNet depth

ilopezgazpio commented 2 years ago

TODO inigo: synset baten depth lortzeko

synsetak lortzean, adb

from nltk.corpus import wordnet as wn
synset = wn.synsets('dog')
s1 = synset[0]

lortzen den objektua nltk.corpus.reader.wordnet.Synset motakoa da hemen dago klasearen dokumentazioa: https://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html eta depth lortzeko modua s1.max_depth() eta s1.min_depth() metodoak dira synsetaren eta erroaren arteko bide posible guztien artean bide maximoa edo bide minimoa lortu daitezke (salto kopurua)