Add similarity table for donees

riceissa commented 6 years ago

Currently, only donors have a similarity table.

Example of where this is needed: https://donations.vipulnaik.com/donee.php?donee=Center+for+Applied+Rationality (just an arbitrary donee)

riceissa commented 6 years ago

This is in progress at https://github.com/vipulnaik/donations/blob/master/similarity/similarity-donee.sql

That SQL file ran for about four hours on my laptop, at which point I killed the process for the following reasons:

I decided four+ hours is impractical.
I realized it was eating up a lot of disk space, consuming about 15GB (and rapidly consuming more) before I killed the process.

So I don't know how long it would have taken to run, or how much disk space it would have required.

I think it was getting stuck at the create table donee_similarity_sim_pre step.

What can we do? Some ideas:

place limit clauses in appropriate places so we aren't dealing with such a huge number of combinations
run separate SQL commands for one donee at at time, rather than doing it as one giant query, and if that works, then running through all donees in a loop
do some calculations to see how long we would expect the process to run: compare a whole-thing-in-one run with doing it one donee at a time

riceissa commented 6 years ago

Progress so far: https://github.com/vipulnaik/donations/blob/49f770c8bd7f70bae595834cb1ba29468299bdaf/similarity/similarity-incremental-donee.sql

This turns out not to be analogous to the incremental donor file, because that one assumes the tables all exist, whereas if we're building everything up incrementally, we have to make the tables ourselves the first time. Also because of this, two_donees_one_donor, after running the file for the first donee, is all just that donee listed twice for each donor, rather than that donee against all other donees for each donor. So donee_sim has just one row rather than [number of donees] rows. I think this also means that as we run the file more and more, the workload goes up.

riceissa commented 6 years ago

Current progress (on the "run separate SQL commands for one donee at at time, rather than doing it as one giant query, and if that works, then running through all donees in a loop" option from above):

https://github.com/vipulnaik/donations/blob/4bb4ed566f6d7570f47d56b5e59f91442220b41c/similarity/similarity-incremental-donee-one-off.sql (meant to be run once to set up the tables)
https://github.com/vipulnaik/donations/blob/4bb4ed566f6d7570f47d56b5e59f91442220b41c/similarity/similarity-incremental-donee.sql (a procedure meant to be called for each donee)

At the moment it takes about 2 minutes per call of updateSimilarityDonee; I'm not sure how the run time changes as we call the procedure more times. Assuming the 2 minutes doesn't change, since we have 18000 donees, that is 2*18000/60 = 600 hours which explains why it didn't finish after 4 hours last time.

riceissa commented 6 years ago

More progress: see 5136c4b21bbf1aad54cef0cfcc581b286ef39c12.

By building up the donor_donee_pairs and donor_donee_pairs_2 tables incrementally, it looks like the first few calls can be made really fast. So if we only care about comparing a couple hundred donees, we can do that quickly. To make that number more precise, I'm testing this as follows.

First, I'm dropping all the similarity tables (especially the donor_donee_pairs and donor_donee_pairs_2 tables that we want to build up incrementally).

Second, I'm running the one-off script to set up the tables.

Third, I'm loading the procedure file.

And then I'm repeatedly calling the procedure like this:

call updateSimilarityDonee("GAVI Alliance");
call updateSimilarityDonee("World Health Organization");
call updateSimilarityDonee("PATH");
call updateSimilarityDonee("The Global Fund to Fight AIDS, Tuberculosis and Malaria");
call updateSimilarityDonee("The Rotary Foundation of Rotary International");
call updateSimilarityDonee("International Bank for Reconstruction and Development");
call updateSimilarityDonee("University of Washington");
call updateSimilarityDonee("PATH Vaccine Solutions");
call updateSimilarityDonee("Johns Hopkins University");
call updateSimilarityDonee("United States Fund for UNICEF");

i.e. calling the procedure for each donee, starting from the largest by donation amount.

The procedure prints the current time with MySQL's NOW() function, so we will be able to plot the number of calls by time and see what's going on (and will help up pick a reasonable cutoff).

riceissa commented 6 years ago

Here is progress so far:

figure_1-2010

The plotting script is here. Run it like ./plot.py data where data is the output of the the calls to updateSimilarityDonee.

So it looks like about 1000 donees in the first hour (which is much better than the 30 per hour of the older method).

riceissa commented 6 years ago

More progress:

figure_1-211

The script actually crashed because one of the donees was too long (I didn't realize the donee column was varchar(200) while donor is varchar(100)). It seems pretty clear what's going on so I won't re-run it.

riceissa commented 5 years ago

Some things to do:

[x] Review previous work in this issue and elsewhere to refresh my memory of what I was even doing
[ ] @vipulnaik suggested breaking the steps out more and storing intermediate results in permanent tables (I think I was already doing this to some extent, but I'm not sure since I don't remember what I was doing here)
[x] Look into possibly doing this with Python

(This list was written while discussing with @vipulnaik )

riceissa commented 5 years ago

I've now implemented the similarity computation in Python for donors at https://github.com/vipulnaik/donations/blob/e2b66172d8b6d66bbb797b25dedde32e95b39780/similarity/similarity.py and for donees at https://github.com/vipulnaik/donations/blob/e2b66172d8b6d66bbb797b25dedde32e95b39780/similarity/similarity_donee.py (for donees I loop over all donees because I was testing how long the whole computation would take)

For donee similarity, I think it takes a little over an hour to compute the similarity information for each donee if we use the full donations table. (With a smaller donations table it can be much faster.) There are around 44,000 donees, so it's pretty impractical to run the whole thing.

The main advantage of Python over doing it directly in SQL is that we only need to store the (donee, total amount) lists for two donors at a time (this is the case for computing donor similarities) rather than storing all (donee, total amount) lists for every donor in an intermediate table. I think this means I don't run into the disk space problem I reported in https://github.com/vipulnaik/donations/issues/12#issuecomment-357563539

Also the runtime seems to be linear in the number of donees processed (this is the case for donee similarities), rather than the weird sublinear shape of the graphs I posted earlier.

riceissa commented 5 years ago

Ok here is an idea: if we consider a single donee like MIRI, which other donees will have non-zero similarity with MIRI? The only way to have non-zero similarity is to have at least one donor in common. So we can look at all MIRI donors, then look at all donees of all MIRI donors. Every donee with non-zero similarity must be included in this list.

For a donee like MIRI, the size of "all donees of all donors to it" is around 500, which is much less than 44,000.

For a donee like PATH, this number is still really high (around 20,000) because its donors are multiple big foundations that have many donees.

riceissa commented 5 years ago

Another idea (which can work in combination with the previous one) is to start the similarity computation for a donee when its page is first requested, storing the result in a SQL table. Then for subsequent requests we can use the stored info and display it on the page.

riceissa commented 5 years ago

It seems like most donees have several thousand other donees that have non-zero similarity (this is mostly because large foundations+donees of such foundations dominate the donations data), so that's several minutes of computation for each donee (and close to an hour for the worst cases). For "EA" donees, the situation is much better (maybe around a minute of computation per donee) because these donees tend to not have received grants from large foundations (so there aren't that many other donees to compare to when figuring out the similarity).

vipulnaik / donations

Add similarity table for donees #12