softsky / people-match-ai

3 stars 0 forks source link

Persons profile looks like this:

+begin_src javascript

{ "_id": { "$oid": "58ea248524dce09f41209710" }, "searchResult": [ { "otherEmails": [], "country": "United States", "source": "PeopleData", "lastUpdated": "2016-09-26T16:54:28Z", "id": "AVdO_GcbPKMaPmPU7jEY", "state": "arizona", "probability": 1, "query": "MANUAL_REQUEST", "firstName": "John", "phone": "1-520-247-9050", "lastName": "Soukup", "stateAbbr": "AZ", "activity": null, "gender": "M", "city": "Tucson", "result": { "resultStatus": "OK", }, "term": "[1]", "mergedIdentities": [ { "source": "EX", "datetime": "2016-06-23T00:00:00Z", "lname": "Soukup", "class": "com.selerityfinancial.person.peopledata.dto.PeopleDataPerson", "email": "c.l.soukup@comcast.net", "fname": "John", "address": { "zip": "85749", "city": "Tucson", "streetAddress": "9352 E Vallarta Trl", "state": "AZ", "class": "com.selerityfinancial.person.peopledata.dto.PeopleDataAddress" }, "ip": [ "71.226.126.234" ], "phone": [ "1-520-247-9050" ], }, ]} ]}

+end_src

** Comments on Person Profile

System support over 240 million profiles of US citizens.

** Requirements

** Other requirements:

+begin_src plantuml :file ./Resources/SystemDeployment.png

package "Training" { node "Training Box" as TB { [Model To Train] [Special Purpose Algorithm] [GPU Rack] } }

cloud "Evaluation" { node "Evaluation Box #1" as EB1{ [Model] as EB1_M [Special Purpose Algorithm] as EB1_SPA [GPU Rack] as EB1_GPU } node "Evaluation Box #2" as EB2{ [Model] as EB2_M [Special Purpose Algorithm] as EB2_SPA [GPU Rack] as EB2_GPU } node "Evaluation Box #3" as EB3{ [Model] as EB3_M [Special Purpose Algorithm] as EB3_SPA [GPU Rack] as EB3_GPU } node "Evaluation Box #N" as EBN { [Model] as EB4_M [Special Purpose Algorithm] as EB4_SPA [GPU Rack] as EB4_GPU }

}

TB --> EB1 TB --> EB2 TB --> EB3 TB --> EBN

+end_src

+RESULTS:

[[file:./Resources/SystemDeployment.png]]

Basically system architecture will look like that:

** Training Box

Training Box performs following operations:

Training Box use Google Tensor Flow as AI

Old and new profiles form Train Corpus which is used by TensorFlow to create new Models

+begin_src plantuml :file ./Resources/Component.png

package "Master" { node "Spark" { [CSVImporter] - Importer:implements [JSONImporter] - Importer:implements [DBResultExporter] - Exporter:implements } node "Hadoop" {

 [in]"HDFS folder" ..> [InputData.csv]:contains

} Spark -> Hadoop:use }

+end_src

+RESULTS:

[[file:./Resources/Component.png]]

*** Deployment

+begin_src plantuml :file ./Resources/TrainingBoxDeployment.png

component [Train Corpus] as Corpus

node { component [TensorFlow] as TF component [Model To Train] as Model }

Corpus --> TF TF --> Model

+end_src

Operaional Sequence is shown here: *** Sequence

+begin_src plantuml :file ./Resources/TrainingSequence.png

Controller --> Crawler: Crawl Profiles Controller <-- Crawler: crawled Profiles Controller --> IdentityMatcher: Match identities and merge profiles Controller <-- Crawler: merged Profiles Controller --> Enhancer: Enhance Profile Controller <-- Enhancer: enhanced Profiles Controller --> DB: Update database with enhanced Profiles

Controller --> Train: Re-train DB with new profile corpus Controller --> Network: Distribute updated Model through Evalutaion nodess

+end_src

*** Identity Match Crawling is performed over multiple resources. We need the way to properly match identities and merge their profiles. We might use email or phone as unique intentifier, since name won't always work. Since some resources might not return unique identifier we use AI comparing fiels.

+begin_src plantuml :file ./Resources/IdentityMatchSequence.png

Controller --> IdentityMatcher: Sends unmatched Profiles for similarity check IdentityMatcher --> TensorFlow: performs field analyzis to determine similarity IdentityMatcher <-- TensorFlow: sends back result for each pair of Profiles IdentityMatcher --> ProfileMerger: Sends profile pairs to be merged ProfileMerger --> DB: updates DataBase with merged profiels

+end_src

Training box will be used most of the time to train all special purposes models using probably slightly modified Inception v3 alorigthm. Traning it from scratch is time consuming operation, however once all special purpose algos and models are trained it could be put down to save hosting cost and be running only once it's needed next time for next alorithm/model train. We will apparently have several purposes (so models and algos) depending on type of information consumers need to receive as the result of their searches.

** Evaluation Box Evaluation boxes will also be used all the time, they will serve large datasets searching for appropriate data according to consumer's search.

+begin_src plantuml :file ./Resources/EvaluationBoxDeployment.png

[Profiles] as Profiles [Model] as Model

node { component [TensorFlow] as TF [Controller] as Controller }

component [Evaluation Result] as Result

Profiles --> Controller Model --> Controller Controller --> TF TF --> Controller Controller --> Result

+end_src

** System Requirements

*** System hardware requirements Here are system software requirements

*** System Software Requirements Here are system software requirements

*** How to run From the project directory

+begin_src shell

docker-compose up docker exec -ti peoplematchai_master_1 bash spb run

+end_src