vemichelleve / finalcode

0 stars 1 forks source link

Manually Handling Exception Cases #2

Open nhanpotter opened 4 years ago

nhanpotter commented 4 years ago

I saw in these 2 functions create_embedding_matrix and word_embed_meta_data manually handle the exception cases from the dataset you're using. However, if another dataset is used, there will be more exception cases and I think it is inefficient to handle these in a manual way. May I ask do you have any ideas on how to resolve this?

vemichelleve commented 4 years ago

Yes, this current method is very inefficient, but I realised it quite late and did not have a chance to address it.

I think the best way is to use a spell checker. There are two ways that you can use it:

I believe NLTK has spell checker library that can correct misspellings or you can also use pyspellchecker, but I don't know how they work so I'm not sure which library works best.

vemichelleve commented 4 years ago

It reminds me there was also another issue when the student answers are O(n) or big O or some other mathematical terms. If I remember correctly, the answer will be "o n" after pre-processing and it will raise an exception since "o" and "n" are not in the dictionary. So, I suggest you add a new case to those two functions that will translate the commonly used terms, such as O(n), into natural words that the model can embed.

nhanpotter commented 4 years ago

Do you think it is better to also use Stemming and Lemmatization?

vemichelleve commented 4 years ago

Yes, stemming and lemmatization can improve the performance, but I don't think it will address this issue completely.