oudalab / fajita

Event Data Tagging Tool
MIT License
7 stars 3 forks source link

Translate english actor name to arabic name, and utilize the actor dictionary already exist in English #180

Open YanLiang1102 opened 7 years ago

YanLiang1102 commented 7 years ago

finish uploading the data to the mongodb , took almost 6 hours, we have 18000+ records in english existing dictionaries but most of them can't find the correspoind arabic name. I add some extra records by doing [['ar_actor],[alternative_ar_names for the actor]] with their multiple roles inserted in to the db, altogether 5696 records inserted.

YanLiang1102 commented 7 years ago

something need to be aware that we store all the actors that coders tagged, and we need to come up intelligent way to clean it up like this word:القوات image choose what is the best one to stay

YanLiang1102 commented 7 years ago

this can be done using wiki, if the old service can find it then make the old service find it, if the old service can't find it then make it go to wiki to find the name, then we can fully utilize our existing dictionary in english

YanLiang1102 commented 7 years ago

we can't get the full response html using python requests, so we can not see the ar_url for some name, look at this page should help: https://stackoverflow.com/questions/37969536/why-are-lis-not-showing-up-with-python-requests-response we need to implement selinum to mimic a browser to make it return eveything.


two things in order to make selenium to work 1 need to make the selenium drive to point to where firefox stored, find by 'which firefox"

  1. need to make the driver point to geckodriver, store this under '/usr/local/bin' otherwise you need to export the path where you store it.

    good soruce to use selenium in python :+1: http://thiagomarzagao.com/2013/11/12/webscraping-with-selenium-part-1/

YanLiang1102 commented 7 years ago

This name does not return anything on wiki but on google directly: SIBGHATULLAH_MOJADEDI since the correct name should be: Sibghatullah Mojaddedi which means in english dictionary that we already have , might not be accurate at all.

YanLiang1102 commented 7 years ago

diretly using wiki url is case sensitive say this BABRAK_KARMAL does not return a record on wiki, but if u use wiki url to search something like Babrak_Karmal it will return something.

YanLiang1102 commented 7 years ago

https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Gathering_data_from_Arabic-Wikipedia http://toolkit-python.readthedocs.io/references/api.html https://github.com/anarchivist/worldcat https://platform.worldcat.org/api-explorer/apis/worldcatidentities

YanLiang1102 commented 7 years ago

Think about to make it run in parallel.