unitedstates / images

Public domain photos of Members of the United States Congress
https://theunitedstates.io/images/
Creative Commons Zero v1.0 Universal
179 stars 53 forks source link

Do the initial loading of images from GPO #1

Closed konklone closed 10 years ago

konklone commented 10 years ago

The congress directory is empty!

konklone commented 10 years ago

Specifically, I'm proposing 4 directories:

I moved the resize_photos.sh script from congress-legislators into this repo, which does this really well if you run it from a directory with a bunch of photos in it. It can be adapted pretty easily to work for this.

It'd be nice if the main Python script kicked this off (or did it itself) after checking for new photos, so that people don't have to remember to run two scripts to keep the photos up to date.

hugovk commented 10 years ago

Originals added in https://github.com/unitedstates/images/commit/d293cac8db12ba43540d626987680ec44f518cbb

For example: http://theunitedstates.io/images/congress/originals/A000055.jpg

Resized to follow.

hugovk commented 10 years ago

Oh, just spotted the directory is called originals and you proposed original. Shall I rename it?

konklone commented 10 years ago

Nice! :) Er, sure, if it's not much trouble, might as well rename it, since the other dirs are adjectives and not plural nouns.

konklone commented 10 years ago

All right, I've beefed up the docs, and updated the script to git clone the YAML rather than download it through the raw.github.com URL.

I've also updated our old contribution page to link here, and added a section to the README on non-GPO contributions (which I hope are rare).

Before we update our Congress API docs to link to this instead of our old zips, I'd like to make sure that we have original, 200x250, 100x125, and 40x50 versions for each MoC, and we retain the same breadth of photos.

We currently have photos in our Sunlight collection for members back to at least 2007 (the 110th Congress). So we should collect photos for the 112th, 111th, and 110th first. Presumably, the cache means it will only bother to download photos for members we don't already have.

konklone commented 10 years ago

Also - if the originals are consistently in the 500+x700+ range, I'd be up for revisiting our supported sizes, and putting some burden on clients to downsize where needed in-browser or in-app.

For example, maybe just:

GPHemsley commented 10 years ago

Why are the directories the sizes and not the bioguide IDs?

It would seem to me that this would be more useful: http://theunitedstates.io/images/congress/A000055/original.jpg http://theunitedstates.io/images/congress/A000055/400x500.jpg http://theunitedstates.io/images/congress/A000055/100x125.jpg

Unless of course you're worried about having descriptive filenames...

GPHemsley commented 10 years ago

Also, since the original images are not always guaranteed to be the same size, maybe the size should only refer to a single axis (e.g. width) so that you're not distorting (or misrepresenting) the aspect ratio of the image?

hugovk commented 10 years ago

https://github.com/unitedstates/images/commit/aa5b395a1e5f8b1c039d2e0512868552150b9f16 kicks off resizing from the script and adds the resized photos, using the initial 200x250/100x125, 40x50 for now.

hugovk commented 10 years ago

@GPHemsley:

About directories, I think original/A000055.jpg is better than A000055/original.jpg. For one, the downloaded image has a more meaningful and unique name -- I can check the size from the image itself, but not the ID. If I download lots, I don't need to rename hundreds of original.jpg files as I go along. The other image APIs I can think of off the top of my head use size/id.jpg: Flickr (who actually use size/id_size.jpg) and Last.fm. Also it's a bit easier to implement this way :)

About sizes, Flickr only refers to the longest size (unless they're square sizes). I think Last.fm do something similar, and both have labels for image sizes.

Anyway, the ImageMagick resize command is using the ^ fill area flag along with -extent to remove excess padding from the resized image:

convert $BASEDIR/original/$f -resize $SIZE^ -gravity center -extent $SIZE $BASEDIR/$SIZE/$f;

hugovk commented 10 years ago

@konklone:

About sizes, the originals are currently sized:

So most are indeed 500+x700+. It's not terribly hard to resize images on our side, so I'm for having having a good selection. Let me know if any sizes should be added/removed.

About going back to 2007: memberguide goes back to the 110th and we have a switch to choose the number. Already downloaded images won't be overwritten. Cached member pages won't be replaced or reloaded from the web, but different congress sessions have different pages and even photos:

So for example, we'll download the 113 page as cache/113_RP_Aderholt. We'll later download the 110 page as cache/110_RP_Aderholt because we don't know if it's the same Aderholt. But we won't replace 113's A000055.jpg with one from 110. So we should download in reverse: 113, 112, 111, 110.

Currently we're matching against legislators-current.yaml which of course only has the current members. Some members are still there from 110, but some will have left. We could use legislators-historical.yaml but don't want to mis-resolve to 18th century people. Sorting first in reverse would match with better ones first, but could still mismatch down the long list. The congress session number isn't in the YAML, but perhaps we could resolve against some dates in the terms? Or perhaps resolve other data from member pages against the YAML. I'm not sure of the future benefit of doing this so I think I'll just manually, temporarily crop legislators-historical.yaml so it only goes back to 2007 (and reverse sort the list first).

konklone commented 10 years ago

@hugovk, looking at your commits, it might be worth reporting the mistakes you found (Curzon, at least) to GPO to see if they can fix it.

@GPHemsley, the dir structure is modeled by how Sunlight's managed our photo archive in the past, which included offering zips of individual directories (all 200x250's for example) and count on each photo being easily tie-able to individual MoCs. So the URLs are more optimized for files than for an ideal web service, but I think that's fine for what this is.

For resizing, we can use imagemagick to ensure a consistent width and height without distorting the aspect ratio (by cropping a bit where needed). That's been what we've done so far and it's never looked bad.

Also: this is coming together SO WELL! :smile_cat:

konklone commented 10 years ago

Looking at the image stats you compiled, @hugovk, it looks like 400x500 might be the best new size to promise? We can promise it for all but 40 of that batch. Then 200x250 covers 39 of those.

It's much easier to ask people to have their browser/app/etc auto-scale an image down than to auto-scale it up. So, proposed sizes:

And if anyone wants smaller, they can auto-scale (or batch process them on their own). Any objections? /cc @dcloud @drinks @jcarbaugh

GPHemsley commented 10 years ago

Given that the majority of the images are 675x825, or an aspect ratio of 9:11, I would think you'd promise a size that was actually in that aspect ratio.

This would mean something like 400x489 or 500x409 or, most likely, 450x550.

GPHemsley commented 10 years ago

And, actually, a lot of the other images are also in a 9:11 ratio with smaller sizes.

So I'd propose these sizes:

konklone commented 10 years ago

Good point, we don't have to lose data just to keep with the sizes of yore. I'd propose 100x120 for that lower tier (to keep round-ish numbers), and then not bother distributing 45x55s.

GPHemsley commented 10 years ago

And those images that aren't in a 9:11 ratio are in a 33:37 ratio (or roughly 3:4), so they would prefer the following sizes:

The full data table is here:

                            1.2222222222        0.8181818182                1.1212121212        0.8918918919    
Count   Width   Height  W/H H/W     Width   9:11 H  Height  9:11 W          Width   33:37 H Height  33:37 W 
1   504 617 0.82    1.22        504 616 617 505         504 565 617 550 
94  452 553 0.82    1.22        452 552 553 452         452 507 553 493 
364 675 825 0.82    1.22        675 825 825 675 *       675 757 825 736 
1   1009    1233    0.82    1.22        1009    1233    1233    1009            1009    1131    1233    1100    
39  230 281 0.82    1.22        230 281 281 230         230 258 281 251 
2   589 719 0.82    1.22        589 720 719 588         589 660 719 641 
13  675 757 0.89    1.12        675 825 757 619         675 757 757 675 
4   495 555 0.89    1.12        495 605 555 454         495 555 555 495 *
1   198 222 0.89    1.12        198 242 222 182         198 222 222 198 *
18  736 825 0.89    1.12        736 900 825 675         736 825 825 736 

You'll note that the stars indicate images that are already in an idealized aspect ratio (i.e. one without rounding): in particular, 675x825 (9:11), 495x555 (33:37), and 198x222 (33:37).

GPHemsley commented 10 years ago

I would recommend against attempting to use "roundish" numbers, as that can significantly change the aspect ratio, particularly at smaller sizes. For example, 100x120 is in the 5:6 aspect ratio; the desired 9:11 height for a 100px width is roughly 122px and the desired 9:11 width for a 120px height is roughly 98px.

konklone commented 10 years ago

Damn, thanks for bringing some rigor to it. And you're right that the aspect ratio should stay consistent between all the sizes, and that's it's nice to stay close to the original aspect ratio.

But I do think there's a value in round numbers here, and these aren't photographs of detailed scenery -- cropping out some flag or solid-color background fluff around the rim loses no meaningful information, and imagemagick makes it trivial to do. I'm more concerned about ensuring people make apps see this as trivially easy to integrate.

So altogether I'm still inclined to go back to the original proposal:

It's quick to understand, it's easy to fit it into one's existing layout -- in other words, it uses a simpler aspect ratio with more round common denominators people can auto-scale them down to if necessary. That's more important to me than preserving original aspect ratio.

GPHemsley commented 10 years ago

Any competent image resizing application can maintain aspect ratio—there's no need to worry about having more round common denominators, IMO.

Given that the majority of the images are in a single aspect ratio, and the rest are in a fairly similar one, I'd recommend the following sizes:

That way, they're all at least divisible by 5, which makes them round enough. And then you can crop the rest based on their width (removing whatever is left over on the top and bottom).

GPHemsley commented 10 years ago

By which I mean: resize to the highest standard width that is less that the original width, and then crop off the top and bottom to match the standard height ((original height - standard height) / 2).

GPHemsley commented 10 years ago

OK, so, I did more thinking on this.

Since we want relatively "round" numbers, I determined all the possible sizes that we could have, using multiples of 5:

Multiplier  Width   Height
5   45  55
10  90  110
15  135 165
20  180 220
25  225 275
30  270 330
35  315 385
40  360 440
45  405 495
50  450 550
55  495 605
60  540 660
65  585 715
70  630 770
75  675 825
80  720 880
85  765 935
90  810 990
95  855 1045
100 900 1100
105 945 1155
110 990 1210

Then I determined which of these sizes we can get out of each original image size:

W   H   original    55  165 220 275 330 385 440 495 550 605 660 715 770 825 880 935 990 1045    1100    1155    1210
1009    1233    1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
736 825 18  18  18  18  18  18  18  18  18  18  18  18  18  18  18  0   0   0   0   0   0   0
675 825 364 364 364 364 364 364 364 364 364 364 364 364 364 364 364 0   0   0   0   0   0   0
675 757 13  13  13  13  13  13  13  13  13  13  13  13  13  0   0   0   0   0   0   0   0   0
589 719 2   2   2   2   2   2   2   2   2   2   2   2   2   0   0   0   0   0   0   0   0   0
504 617 1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0
495 555 4   4   4   4   4   4   4   4   4   4   0   0   0   0   0   0   0   0   0   0   0   0
452 553 94  94  94  94  94  94  94  94  94  94  0   0   0   0   0   0   0   0   0   0   0   0
230 281 39  39  39  39  39  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
198 222 1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

        537 537 537 537 536 497 497 497 497 497 399 398 398 383 383 1   1   1   1   1   1   1
        100.0%  100.0%  100.0%  100.0%  99.8%   92.6%   92.6%   92.6%   92.6%   92.6%   74.3%   74.1%   74.1%   71.3%   71.3%   0.2%    0.2%    0.2%    0.2%    0.2%    0.2%    0.2%

According to these calculations, these are the largest sizes we can use that will cover a certain percentage of images:

Given that 71.3% of images can be covered by the original size of two thirds of the images, I recommend we offer that size by default. Then, we can offer a bunch of sizes that apply to most of the images.

As such, I propose we offer the following sizes:

In general, they line up with what has already been proposed. But now they're backed by data!

hugovk commented 10 years ago

@konklone, I've emailed GPO to ask them to correct Byrne Bradley and David Curzon.

konklone commented 10 years ago

@hugovk, thank you!

Does anyone else think 450x550 and 225x275 are better choices? I'm open to it if I'm the only one who thinks round numbers are important.

The reason I think round numbers are important is partly for automatic downscaling, so that someone can still fit these images into their 150px-wide space through hotlinking and have the dimensions stay integers, without having to run the images through a batch process and host them everywhere. For example, in our Congress app for Android, we take the 200x250 version and stretch it up or down to fit the density of the screen. And then partly it's cognitive aesthetics - people can more quickly grasp what we offer, how wide things are, and don't have to think hard about it.

I hate taking so much time to talk about this! But since we're generating permalinks, it's hard to take them back. Anyone have an opinion to help settle this?

konklone commented 10 years ago

Either way, I think we shouldn't promise a size that only ~70% of the photos have. ~90% is more reasonable, so I'd put 450x500 as the upper limit. And in your list, the 180-width one is close enough to the 225-width one, I don't think it's worthwhile.

OK, so someone, anyone, ring in with a preference for:

JoshData commented 10 years ago

So I'm already rescaling myself from the original, and I've totally lost track of the data here.... Which of the pairs (400x500 vs 450x500; 200x250 vs 225x275) is the closest to the most common aspect ratio? (i.e. minimize data loss during resizing)

konklone commented 10 years ago

The bigger ones (450x550 and 225x275) minimize data loss and are closest to the most common aspect ratio.

JoshData commented 10 years ago

So I'd say go with that, but take that as only a +0.1 since I probably won't be using the scaled versions in the near future anyway. (Also, resolutions keep getting higher so 100px is probably not going to be very useful for much longer.)

konklone commented 10 years ago

All right, then barring objection let's go with the bigger sizes.

GPHemsley commented 10 years ago

@konklone You don't seem to think that ~70% is a high number, and I'm not sure why. It seems to be the standard size going forward, and it's already the same as the "original" size for most of the ones that are currently available. If we restrict our sizes to the least common denominator, we're going to lose a lot of information for no reason, I think.

But perhaps I didn't mention that clearly before: I chose 675x825 specifically because it's already the original size for a good majority of the images, and it also captures an intermediate size for the handful of images whose original size is larger than that. It seems to me that newer images will be posted, at a minimum, in this size (though I'm just guessing based on the number of images that already have this size).

GPHemsley commented 10 years ago

@konklone @JoshData Also, the larger sizes (450x550 and 225x275) are not just close to being in the most common aspect ratio, they are in the most common aspect ratio. And I don't think it's too much to ask for developers to deal with numbers divisible by 5—it's not just any old arbitrary number. In fact, aside from the smaller one which you've rejected because of its closeness to the another (though I included it because it's the largest size that fits all images), all of the sizes in question are also divisible by 25, which is still super round.

JoshData commented 10 years ago

Okay.... I think everyone's happy with the 450 and 225 sizes, and since no one actually is coming with a use case/need for 675x825 I don't think there's any reason to keep talking about it @GPHemsley.

hugovk commented 10 years ago

There may be a use case for a small, thumbnail and possibly a square avatar-size, but we can add smaller sizes as and when a need arises.

konklone commented 10 years ago

Definitely. And thank you once again for the follow-through, @hugovk, I see the images and docs have all been updated. :)

http://theunitedstates.io/images/congress/450x550/L000551.jpg

I'm ready to start pointing people here and telling them to use the URLs.