Open AndreG-P opened 5 years ago
For (1) If we are looking for bi-grams/tri-grams, we can produce this you mentioned in (1). I think the main point is, what n-grams to keep in the end. If this was a statistical approach, it would be more interesting to do some sort of BOW+tf-idf, but in our case it won't be good. The dictionary thing would be the best thing I think, because we can produce the n-grams per document, and keep those we know should be together.
Otherwise we will need to do collocation/vocabulary to keep only those that appear more in the entire corpus. I'm not a big fan of this for two reasons: (i) word2vec came to deal with stuff like this and (ii) we would have to parse the entire corpus, produce the n-grams, find the "best" cut to leave only n-grams that appear in X% of the documents.
For (2), once everything is tagged, this should be easy. I already have the code for tagging using NLTK, we just need to be sure what word (tags) to put together and how far.
In your example, Van
der
Waerdens
number
, would look something like this:
('Van', 'NNP'), ('der', 'NN'), ('Waerdens', 'NNP'), ('number', 'NN')
or
('Van_der_Waerdens_number', 'NNP')
@AndreG-P , since you want to put together noun-noun, noun-adjective I will need to change things in the approach. We cannot do stopword removal before the POS tag, they use some classification rules to infer tags. In this case, we should tag everything first than start to do whatever we want.
I will alter things so we start with the tagging first, than the rest. Once all tags are there we can remove the stopwords, lowercase, stem or anything else.
T.
@AndreG-P ,
I did some changing in the code in order to:
math-?
intactNow comes the issue you reported here. I'm not sure if combining "similar" POS tags will improve what we have that much. However, I implemented a method to keep the example you put above. I consider the following POS tags keep_tags = ['JJ','JJR','JJS','NN','NNS','NNP','NNPS']
. They are all Nouns and Adjectives, according to the NLTK pos_tag() list:
As for right now, I'm only keeping a continuous window for this of two consecutive elements-POS. For example:
the mom and the kid like to eat ice cream and potatos, but they run in the sunny park every day really keeping the cat eating more stuff
mom_kid kid like eat ice_cream cream_potatos potatos run sunny_park park every day really keeping cat eating stuff
If we really want to concatenate more than two "similar" items we could, but I would need more time to work in a good algorithm. However, since our documents (sentences) are pretty small this will do more harm than good, honestly. I still have my doubts if concatenating will do any good. Let me know if this and the other items (#10, #2 , #1) are good to close.
T.
@truas
Wait I don't get the example. I think we should merge and after that remove the stopwords. I don't get why we have mom_kid
, cream_potatoes
, etc. Based on the input I would expect this:
mom kid like eat ice_cream potatos run sunny_park every day really keeping cat eating stuff
@truas I think we should do the following:
1) PoS-Tagging paragraph-wise (line-wise)
2) Truncate last s
of nouns (maybe we should use a dictionairy here and see if the truncated version is in the dictionary?
3) Merge nouns-nouns and adjective-nouns chains (also chains of nouns-nouns-nouns) and (adjective-adjective-noun-noun)
4) Delete all stop-words
@AndreG-P
T.
We should merge some words because we have to identify them as one entity. Classical example is to distinguish between integer, positive integer, and _negativeinteger.
After discussions with Aizawa-sensei and others, we probably can just apply 2 simple rules.
Catalan number
->catalan_number
.arbitrary positive integer
->arbitrary_positive_integer
.@truas you mentioned we should avoid too long chains. So I would say we only allow a maximum length of 3 words? @physikerwelt what do you think?
@truas A question about the PoS-Tagger. Our first example in the gold standard contains
W
as theVan der Waerdens number
. Do you know if this is possible to merge by our rules 1 and 2? I wonder how the PoS tagger tagsder
betweenVan
andWaerden
.