Test out the Arabic tokenizer and apply it to into this python documents clustering.

oudalab / fajita

Event Data Tagging Tool

MIT License

7 stars 3 forks source link

Test out the Arabic tokenizer and apply it to into this python documents clustering. #149

Open YanLiang1102 opened 7 years ago

YanLiang1102 commented 7 years ago

Here are two reference link that might be useful. http://stackoverflow.com/questions/13035595/tokenization-of-arabic-words-using-nltk http://brandonrose.org/clustering but our data is big can't not fit in memory.

YanLiang1102 commented 7 years ago

This can be a free source for our translation, it is document translation need to looks into the code to see how does the document translation use the mircosoft translation api. https://github.com/MicrosoftTranslator

YanLiang1102 commented 7 years ago

the microsoft one we need to pay for that, This one looks more promising. http://www.sikher.com

YanLiang1102 commented 7 years ago

https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

YanLiang1102 commented 7 years ago

setup jupyter kernel with the python virtual env http://stackoverflow.com/questions/33496350/execute-python-script-within-jupyter-notebook-using-a-specific-virtualenv mongo db dump with crendentials mongodump -h SERVER_NAME:PORT -d DATABASE_NAME -c collection_name -u DATABASE_USER -p PASSWORD scp it from portland to hanover then using mongorestore to restore that in mongo mongorestore --collection people --db accounts dump/

YanLiang1102 commented 7 years ago

After we get the corpus matrix from LDA, we can apply k-means to cluster our documents, as this thread suggests https://datascience.stackexchange.com/questions/2464/clustering-of-documents-using-the-topics-derived-from-latent-dirichlet-allocatio