Open YanLiang1102 opened 7 years ago
something need to be aware that we store all the actors that coders tagged, and we need to come up intelligent way to clean it up like this word:القوات choose what is the best one to stay
this can be done using wiki, if the old service can find it then make the old service find it, if the old service can't find it then make it go to wiki to find the name, then we can fully utilize our existing dictionary in english
we can't get the full response html using python requests, so we can not see the ar_url for some name, look at this page should help: https://stackoverflow.com/questions/37969536/why-are-lis-not-showing-up-with-python-requests-response we need to implement selinum to mimic a browser to make it return eveything.
two things in order to make selenium to work 1 need to make the selenium drive to point to where firefox stored, find by 'which firefox"
good soruce to use selenium in python :+1: http://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/
This name does not return anything on wiki but on google directly: SIBGHATULLAH_MOJADEDI since the correct name should be: Sibghatullah Mojaddedi which means in english dictionary that we already have , might not be accurate at all.
diretly using wiki url is case sensitive say this BABRAK_KARMAL does not return a record on wiki, but if u use wiki url to search something like Babrak_Karmal it will return something.
Think about to make it run in parallel.
finish uploading the data to the mongodb , took almost 6 hours, we have 18000+ records in english existing dictionaries but most of them can't find the correspoind arabic name. I add some extra records by doing [['ar_actor],[alternative_ar_names for the actor]] with their multiple roles inserted in to the db, altogether 5696 records inserted.