Manually Handling Exception Cases

nhanpotter commented 4 years ago

I saw in these 2 functions create_embedding_matrix and word_embed_meta_data manually handle the exception cases from the dataset you're using. However, if another dataset is used, there will be more exception cases and I think it is inefficient to handle these in a manual way. May I ask do you have any ideas on how to resolve this?

vemichelleve commented 4 years ago

Yes, this current method is very inefficient, but I realised it quite late and did not have a chance to address it.

I think the best way is to use a spell checker. There are two ways that you can use it:

In the 2 functions create_embedding_matrix and word_embed_meta_data and replace the manual exception handling.
Before the answers are added into the database so that those in the database are already with the correct spelling so there is no need to check the spelling every time the model is trained.

I believe NLTK has spell checker library that can correct misspellings or you can also use pyspellchecker, but I don't know how they work so I'm not sure which library works best.

vemichelleve commented 4 years ago

It reminds me there was also another issue when the student answers are O(n) or big O or some other mathematical terms. If I remember correctly, the answer will be "o n" after pre-processing and it will raise an exception since "o" and "n" are not in the dictionary. So, I suggest you add a new case to those two functions that will translate the commonly used terms, such as O(n), into natural words that the model can embed.

nhanpotter commented 4 years ago

Do you think it is better to also use Stemming and Lemmatization?

vemichelleve commented 4 years ago

Yes, stemming and lemmatization can improve the performance, but I don't think it will address this issue completely.

vemichelleve / finalcode

Manually Handling Exception Cases #2