Open nhanpotter opened 4 years ago
Yes, this current method is very inefficient, but I realised it quite late and did not have a chance to address it.
I think the best way is to use a spell checker. There are two ways that you can use it:
In the 2 functions create_embedding_matrix
and word_embed_meta_data
and replace the manual exception handling.
Before the answers are added into the database so that those in the database are already with the correct spelling so there is no need to check the spelling every time the model is trained.
I believe NLTK has spell checker library that can correct misspellings or you can also use pyspellchecker, but I don't know how they work so I'm not sure which library works best.
It reminds me there was also another issue when the student answers are O(n) or big O or some other mathematical terms. If I remember correctly, the answer will be "o n" after pre-processing and it will raise an exception since "o" and "n" are not in the dictionary. So, I suggest you add a new case to those two functions that will translate the commonly used terms, such as O(n), into natural words that the model can embed.
Do you think it is better to also use Stemming and Lemmatization?
Yes, stemming and lemmatization can improve the performance, but I don't think it will address this issue completely.
I saw in these 2 functions
create_embedding_matrix
andword_embed_meta_data
manually handle the exception cases from the dataset you're using. However, if another dataset is used, there will be more exception cases and I think it is inefficient to handle these in a manual way. May I ask do you have any ideas on how to resolve this?