Open YanLiang1102 opened 7 years ago
This can be a free source for our translation, it is document translation need to looks into the code to see how does the document translation use the mircosoft translation api. https://github.com/MicrosoftTranslator
the microsoft one we need to pay for that, This one looks more promising. http://www.sikher.com
setup jupyter kernel with the python virtual env http://stackoverflow.com/questions/33496350/execute-python-script-within-jupyter-notebook-using-a-specific-virtualenv mongo db dump with crendentials mongodump -h SERVER_NAME:PORT -d DATABASE_NAME -c collection_name -u DATABASE_USER -p PASSWORD scp it from portland to hanover then using mongorestore to restore that in mongo mongorestore --collection people --db accounts dump/
After we get the corpus matrix from LDA, we can apply k-means to cluster our documents, as this thread suggests https://datascience.stackexchange.com/questions/2464/clustering-of-documents-using-the-topics-derived-from-latent-dirichlet-allocatio
Here are two reference link that might be useful. http://stackoverflow.com/questions/13035595/tokenization-of-arabic-words-using-nltk http://brandonrose.org/clustering but our data is big can't not fit in memory.