Open ilopezgazpio opened 3 years ago
TODO inigo: synset baten depth lortzeko
synsetak lortzean, adb
from nltk.corpus import wordnet as wn
synset = wn.synsets('dog')
s1 = synset[0]
lortzen den objektua nltk.corpus.reader.wordnet.Synset motakoa da hemen dago klasearen dokumentazioa: https://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html eta depth lortzeko modua s1.max_depth() eta s1.min_depth() metodoak dira synsetaren eta erroaren arteko bide posible guztien artean bide maximoa edo bide minimoa lortu daitezke (salto kopurua)
Parte generiko
1.- Pentsatu ze aurreprozesu beharko duten left eta right zutabeak -> left_strip / right_strip : .strip() -> left_strip_tokenized : nltk koa baino stop word gabe https://stackabuse.com/removing-stop-words-from-strings-in-python/ bektore_emaitza = [word for word in text_tokens] -> left_strip_tokenized_noPunct : https://www.kite.com/python/answers/how-to-remove-all-punctuation-marks-with-nltk-in-python -> left_strip_tokenized_noPunct_noStopWords : https://stackabuse.com/removing-stop-words-from-strings-in-python/
-> left_strip_tokenized_noPunct_stopwords : https://stackabuse.com/removing-stop-words-from-strings-in-python/
baino if a inbersoa
array_stopwords = [word for word in text_tokens IF word in stopwords.words()]
left_strip_tokenized_noPunct_noStopWords = content_words
Feature-ak
1- Jackard overlap left_strip_tokenized (with punctuations) 2- Jaccard overlap of content words 3- Jackard overlap of stopwords
4- Difference in length between chunks 1 and 2 5- Difference in length between chunks 2 and 1
RAW
ONTHOLOGY -> left_strip_tokenized_noPunct_Nostopwords https://www.nltk.org/howto/wordnet.html#similarity
Nota: ez dakit ze pasakodan adb: dogs , pasado berboak lematizatu -> encontrar la raiz de una palabra -> ate -> eat
https://aclanthology.org/N04-3012.pdf
LCS -> Least Common Subsumer -> ancestro comun mas cercano
simulate_root=False
6- Max WordNet PATH SIMILARITY of all sense pairs (Pedersen et al., 2004) dog.path_similarity(cat) 7- Max WordNet LCH similarity of sense pairs (Leacock and Chodorow, 1998) dog.lch_similarity(cat) 8- Max WordNet JCN similarity of sense pairs (Jiang and Conrath, 1997) jcn_similarity
?? - WUP, nahi badezu ?? - res_similarity ?? - lin_sim
simulate_root=True ((onthology root is the LCS) -> "normally turrns to give higher similarity scores")
9- Same as 6 but simulating root with the maximum common subsumer 10- Same as 7 but simulating root with the maximum common subsumer 11- Same as 8 but simulating root with the maximum common subsumer
WORDNET RAW 12- Whether chunk 1 senses are more specific than chunk 2 senses in the WordNet hierarchy (Fellbaum, 1998) 13- Whether chunk 2 senses are more specific than chunk 1 senses in the WordNet hierarchy -> begiratzea wordneteko nibela -> sakonena da espezifikoena return true if depth(w1) > depth(w2) # depth haundigo, mas especifico
14- Difference in WordNet depth of segment head -> 12 eta 13ren arteko aldea 15- Minimum value of pairwise difference of WordNet depth 16- Maximum value of pairwirse difference of WordNet depth