mitmedialab / ajl.ai

A web application for crowdsourcing image annotations.
GNU Affero General Public License v3.0
16 stars 9 forks source link

Import VGG Faces #252

Open boazsender opened 6 years ago

boazsender commented 6 years ago

Import one of each of the 2,622 identities from the total dataset contains over 900,000 images http://www.robots.ox.ac.uk/~vgg/data/vgg_face

Prioritize the VGG faces over the IMDB wiki uniques. Here is a script for downloading the assets: https://bleuren.me/63/vgg-face-dataset/

boazsender commented 6 years ago

I took a first pass at processing vgg_face_dataset to include one image per identity. I naively selected the first of each identity, and appended the output in this file, along with the identity of the subject: vgg_face_dataset_first_image_per_identity.txt

Does this work for you @joyab? If so, I'll use this to write a migration to load the image records in.

Second question, do we need to host the images, or do you want to leave them on their respective servers?

joyab commented 6 years ago

The first pass looks good. Please proceed with migration Let's rehost.

All my best, Joy

On Mon, Jan 8, 2018 at 2:01 AM, Boaz notifications@github.com wrote:

I took a first pass at processing vgg_face_dataset to include one image per identity. I naively selected the first of each identity, and appended the output in this file: vgg_face_dataset_first_image_per_identity.txt https://gist.github.com/boazsender/6628205d677685078b7ca8fdfe6e3040

Does this work for you @joyab https://github.com/joyab? If so, I'll use this to write a migration to load the image records in.

Second question, do we need to host the images, or do you want to leave them on their respective servers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bocoup/ajl.ai/issues/252#issuecomment-355902310, or mute the thread https://github.com/notifications/unsubscribe-auth/ANjGUcrjWo7Pxc8y-aatCwZS_wEVWbyNks5tIctegaJpZM4QsRtH .

-- -- Joy Buolamwini MIT Media Lab - Graduate Researcher Founder - Algorithmic Justice League www.ajlunited.org

Portfolio: www.poetofcode.com Blog : www.medium.com/@Joy.Buolamwini Twitter: @jovialjoy http://www.twitter.com/jovialjoy @AJLUnited http://www.twitter.com/ajlunited Profile: Linkedin http://www.linkedin.com/in/buolamwini Skype: jbuolamwini

boazsender commented 6 years ago

Of the 2,622 first images per idenity, 481 appear to be dead urls.

I'm going to try to write something next that crawls through the vggface data files (one file per identity with n number of image urls per file) until it gets a good image for each identity, so that at least we can get a baseline of 2,622 actual faces. Will follow up here.

In the meantime, I'm documenting the processing and scraping scripts that I'm making in gists:

boazsender commented 6 years ago

Finished a script that keeps trying subsequent images if it hits a dead link:

https://gist.github.com/boazsender/6628205d677685078b7ca8fdfe6e3040#file-vgg_process_files_and_scrape-sh

We'll use this to create our vgg set.

boazsender commented 6 years ago

I ran this over night, and got down under 1% loss. With 20 images missing, I can make incremental improvements to test the files, or we can just replace them manually.

ajlunited commented 6 years ago

20 we can handle manually.

On Wed, Jan 10, 2018 at 8:12 AM, Boaz notifications@github.com wrote:

I ran this over night, and got down under 1% loss. With 20 images missing, I can make incremental improvements to test the files, or we can just replace them manually.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bocoup/ajl.ai/issues/252#issuecomment-356598335, or mute the thread https://github.com/notifications/unsubscribe-auth/AYi1IaVzbDhrdah341XRsoKWg62sa_eLks5tJLdCgaJpZM4QsRtH .

-- -- Joy Buolamwini MIT Media Lab - Graduate Researcher Founder - Algorithmic Justice League www.ajlunited.org

Portfolio: www.poetofcode.com Blog : www.medium.com/@Joy.Buolamwini Twitter: @jovialjoy http://www.twitter.com/jovialjoy @AJLUnited http://www.twitter.com/ajlunited Profile: Linkedin http://www.linkedin.com/in/buolamwini Skype: jbuolamwini

boazsender commented 6 years ago

I ended up rewriting the script to ignore 0 byte files as well as files that had data but were not images.

I also added logic to correctly identify file format and give the image its proper extension, which lead me to find that vgg_faces has animated gifs in it 0_o.