Open boazsender opened 6 years ago
I took a first pass at processing vgg_face_dataset to include one image per identity. I naively selected the first of each identity, and appended the output in this file, along with the identity of the subject: vgg_face_dataset_first_image_per_identity.txt
Does this work for you @joyab? If so, I'll use this to write a migration to load the image records in.
Second question, do we need to host the images, or do you want to leave them on their respective servers?
The first pass looks good. Please proceed with migration Let's rehost.
All my best, Joy
On Mon, Jan 8, 2018 at 2:01 AM, Boaz notifications@github.com wrote:
I took a first pass at processing vgg_face_dataset to include one image per identity. I naively selected the first of each identity, and appended the output in this file: vgg_face_dataset_first_image_per_identity.txt https://gist.github.com/boazsender/6628205d677685078b7ca8fdfe6e3040
Does this work for you @joyab https://github.com/joyab? If so, I'll use this to write a migration to load the image records in.
Second question, do we need to host the images, or do you want to leave them on their respective servers?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bocoup/ajl.ai/issues/252#issuecomment-355902310, or mute the thread https://github.com/notifications/unsubscribe-auth/ANjGUcrjWo7Pxc8y-aatCwZS_wEVWbyNks5tIctegaJpZM4QsRtH .
-- -- Joy Buolamwini MIT Media Lab - Graduate Researcher Founder - Algorithmic Justice League www.ajlunited.org
Portfolio: www.poetofcode.com Blog : www.medium.com/@Joy.Buolamwini Twitter: @jovialjoy http://www.twitter.com/jovialjoy @AJLUnited http://www.twitter.com/ajlunited Profile: Linkedin http://www.linkedin.com/in/buolamwini Skype: jbuolamwini
Of the 2,622 first images per idenity, 481 appear to be dead urls.
I'm going to try to write something next that crawls through the vggface data files (one file per identity with n number of image urls per file) until it gets a good image for each identity, so that at least we can get a baseline of 2,622 actual faces. Will follow up here.
In the meantime, I'm documenting the processing and scraping scripts that I'm making in gists:
Finished a script that keeps trying subsequent images if it hits a dead link:
We'll use this to create our vgg set.
I ran this over night, and got down under 1% loss. With 20 images missing, I can make incremental improvements to test the files, or we can just replace them manually.
20 we can handle manually.
On Wed, Jan 10, 2018 at 8:12 AM, Boaz notifications@github.com wrote:
I ran this over night, and got down under 1% loss. With 20 images missing, I can make incremental improvements to test the files, or we can just replace them manually.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bocoup/ajl.ai/issues/252#issuecomment-356598335, or mute the thread https://github.com/notifications/unsubscribe-auth/AYi1IaVzbDhrdah341XRsoKWg62sa_eLks5tJLdCgaJpZM4QsRtH .
-- -- Joy Buolamwini MIT Media Lab - Graduate Researcher Founder - Algorithmic Justice League www.ajlunited.org
Portfolio: www.poetofcode.com Blog : www.medium.com/@Joy.Buolamwini Twitter: @jovialjoy http://www.twitter.com/jovialjoy @AJLUnited http://www.twitter.com/ajlunited Profile: Linkedin http://www.linkedin.com/in/buolamwini Skype: jbuolamwini
I ended up rewriting the script to ignore 0 byte files as well as files that had data but were not images.
I also added logic to correctly identify file format and give the image its proper extension, which lead me to find that vgg_faces has animated gifs in it 0_o.
Import one of each of the 2,622 identities from the total dataset contains over 900,000 images http://www.robots.ox.ac.uk/~vgg/data/vgg_face
Prioritize the VGG faces over the IMDB wiki uniques. Here is a script for downloading the assets: https://bleuren.me/63/vgg-face-dataset/