gipplab / math2vec

Set of tools we used to create, cultivate and process datasets for our math2vec project.
6 stars 0 forks source link

Kind of N-Grams (merging nouns and adj-nouns) #4

Open AndreG-P opened 5 years ago

AndreG-P commented 5 years ago

We should merge some words because we have to identify them as one entity. Classical example is to distinguish between integer, positive integer, and _negativeinteger.

After discussions with Aizawa-sensei and others, we probably can just apply 2 simple rules.

  1. Merge consecutive nouns, e.g., Catalan number -> catalan_number.
  2. Merge consecutive adjective-nouns, e.g., arbitrary positive integer -> arbitrary_positive_integer.

@truas you mentioned we should avoid too long chains. So I would say we only allow a maximum length of 3 words? @physikerwelt what do you think?

@truas A question about the PoS-Tagger. Our first example in the gold standard contains W as the Van der Waerdens number. Do you know if this is possible to merge by our rules 1 and 2? I wonder how the PoS tagger tags der between Van and Waerden.

truas commented 5 years ago

For (1) If we are looking for bi-grams/tri-grams, we can produce this you mentioned in (1). I think the main point is, what n-grams to keep in the end. If this was a statistical approach, it would be more interesting to do some sort of BOW+tf-idf, but in our case it won't be good. The dictionary thing would be the best thing I think, because we can produce the n-grams per document, and keep those we know should be together.

Otherwise we will need to do collocation/vocabulary to keep only those that appear more in the entire corpus. I'm not a big fan of this for two reasons: (i) word2vec came to deal with stuff like this and (ii) we would have to parse the entire corpus, produce the n-grams, find the "best" cut to leave only n-grams that appear in X% of the documents.

truas commented 5 years ago

For (2), once everything is tagged, this should be easy. I already have the code for tagging using NLTK, we just need to be sure what word (tags) to put together and how far.

In your example, Van der Waerdens number, would look something like this:

('Van', 'NNP'), ('der', 'NN'), ('Waerdens', 'NNP'), ('number', 'NN')

or

('Van_der_Waerdens_number', 'NNP')

truas commented 5 years ago

@AndreG-P , since you want to put together noun-noun, noun-adjective I will need to change things in the approach. We cannot do stopword removal before the POS tag, they use some classification rules to infer tags. In this case, we should tag everything first than start to do whatever we want.

I will alter things so we start with the tagging first, than the rest. Once all tags are there we can remove the stopwords, lowercase, stem or anything else.

T.

truas commented 5 years ago

@AndreG-P ,

I did some changing in the code in order to:

  1. Perform a proper POS tag (considering the stopwords)
  2. Remove stopwords
  3. Lowercase, keeping math-? intact

Now comes the issue you reported here. I'm not sure if combining "similar" POS tags will improve what we have that much. However, I implemented a method to keep the example you put above. I consider the following POS tags keep_tags = ['JJ','JJR','JJS','NN','NNS','NNP','NNPS'] . They are all Nouns and Adjectives, according to the NLTK pos_tag() list:

As for right now, I'm only keeping a continuous window for this of two consecutive elements-POS. For example:

If we really want to concatenate more than two "similar" items we could, but I would need more time to work in a good algorithm. However, since our documents (sentences) are pretty small this will do more harm than good, honestly. I still have my doubts if concatenating will do any good. Let me know if this and the other items (#10, #2 , #1) are good to close.

T.

AndreG-P commented 5 years ago

@truas Wait I don't get the example. I think we should merge and after that remove the stopwords. I don't get why we have mom_kid, cream_potatoes, etc. Based on the input I would expect this:

mom kid like eat ice_cream potatos run sunny_park every day really keeping cat eating stuff

AndreG-P commented 5 years ago

@truas I think we should do the following:

1) PoS-Tagging paragraph-wise (line-wise) 2) Truncate last s of nouns (maybe we should use a dictionairy here and see if the truncated version is in the dictionary? 3) Merge nouns-nouns and adjective-nouns chains (also chains of nouns-nouns-nouns) and (adjective-adjective-noun-noun) 4) Delete all stop-words

truas commented 5 years ago

@AndreG-P

  1. Done - I'm assuming you want to POS tag each word in the sentence right? I don't think tag the entire sentence as one token would make any good us ha-ha (it can be done though)
  2. Done (See my last update on #1)
  3. Any combination of nouns and adjectives are taken care of. Let's not do 3-gram on this, our "corpus to be embeded" will end up to short. For sure do2vec will produce lousy vectors.
  4. Done. What do you think of keeping verbs so we have have variety in our corpus? If we do remove them, it wouldn't make sense to keep other POS, like adverbs. Unless you already know that verbs only produce noise.

T.