Generate a matrix of pairwise path lengths for all neuroelectro authors

stripathy commented 8 years ago

@svdavid , here's the neuroelectro spreadsheet, the relevant column name here is 'Pmid': http://dev.neuroelectro.org/static/src/article_ephys_metadata_curated.csv

No need to re-download if you have the link I sent out from a couple weeks ago, the two spreadsheets will be very similar.

svdavid commented 8 years ago

Ok, some progress. First pass at the link between NE pmids and NT pids is here: http://neurotree.org/tmp/ne_nt_match_v1.txt

tab-delimited, first column is PMID, second is NT pid of last author (0 means no match), third is confidence (0 low, 1 high).

353 matches to some pid gives ~62000 pairs to calculate distances. This calculation is underway.

On Wed, Feb 17, 2016 at 1:28 PM, Shreejoy Tripathy <notifications@github.com

wrote:

Assigned #8 https://github.com/neuroelectro/neuroelectro_neurotree/issues/8 to @svdavid https://github.com/svdavid.

— Reply to this email directly or view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#event-554215711 .

rgerkin commented 8 years ago

@stripathy @svdavid Since the new approach I proposed for using path length vectors as features requires the path length matrix, should I wait until this is more fleshed out, or proceed with the one we have now?

stripathy commented 8 years ago

@rgerkin it's up to you - either way we'll want to do the analysis on just CA1 pyramidal cells and the analysis should be mostly the same given more cell types.

svdavid commented 8 years ago

At this point it might make sense to see if you can deal with the current data formatting and/or if there's any additional info you'd want to extract. The path matrix and pmid-pid links may be improved, but their structure should not change.

Speaking of that, the table where I'm storing the pairwise distances has one row for each (pid1,pid2) pair. It might be a little easier to export it that way, just to avoid having so many hundred of columns. eg

mysql> select p1,p2,d from pairDist limit 20; +----+-----+------+ | p1 | p2 | d | +----+-----+------+ | 77 | 135 | 5 | | 77 | 169 | 8 | | 77 | 219 | 6 | | 77 | 266 | 5 | | 77 | 272 | 6 | | 77 | 297 | 8 | | 77 | 366 | 4 | | 77 | 368 | 5 | | 77 | 404 | 5 | | 77 | 455 | 8 | | 77 | 458 | 8 | | 77 | 685 | 8 | | 77 | 691 | 4 | | 77 | 700 | -1 | | 77 | 809 | 6 | | 77 | 874 | 5 | | 77 | 875 | 4 | | 77 | 892 | 5 | | 77 | 906 | 8 | | 77 | 968 | 5 | etc....

That work for you?

stephen

On Thu, Feb 18, 2016 at 1:54 PM, Shreejoy Tripathy <notifications@github.com

wrote:

@rgerkin https://github.com/rgerkin it's up to you - either way we'll want to do the analysis on just CA1 pyramidal cells and the analysis should be mostly the same given more cell types.

— Reply to this email directly or view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#issuecomment-185938609 .

svdavid commented 8 years ago

Extract pairwise distance matrix through a common ancestor. For ease of use, this contains redundancies, ie., pairs (p1,p2) and (p2,p1) have the same distance. Does this format work? Should generalize to any connection matrix.

http://neurotree.org/beta/include/dist_mtx.php

This is currently being populated (2/18/16 pm). Please don't query repeatedly, since it's not a trivial-sized query.

Note in passing: You'll see that lower pids tend to be a lot more connected while higher pids (added more recently) are more likely not to be connected to most others.

svdavid commented 8 years ago

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

rgerkin commented 8 years ago

@svdavid I think we can discover the M interesting nodes automatically be first doing the full analysis on the full N x N matrix, and then extracting the interesting M of those N. Or am I overestimating the likelihood that the M of interest will even have papers here (maybe they trained many of the N but otherwise haven't published much in the neuroelectro CA1 time frame).

@stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

rgerkin commented 8 years ago

Should I abandon some of @stripathy's code in favor of http://neurotree.org/beta/include/dist_mtx.php, or should I wait on that?

stripathy commented 8 years ago

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

Yeah - I agree. @rgerkin I think that @svdavid is proposing this mostly as a practicality thing. My guess is that it's not computationally trivial to generate the full N x N NE author matrix (for whatever reason) but it'd be much quicker just to generate the N x M matrix. @svdavid am I right in this?

@stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

Ah sorry! Hopefully it's not too painful. This side project seems like a good excuse to migrate over to python 3.

svdavid commented 8 years ago

Learning some interesting things about database table locks and the intricacies of running multiple processes at once on the same db. Running the NE distance matrix calc on the mirror server now, and it's going a lot faster. Also realizing how it can be really sped up if/when I get around to writing some more code.

In the meantime, take 1 of the full matrix should be done tomorrow morning.

I'm also realizing that a reduced size fingerprint scheme may provide an intersting way of clustering the whole tree and visualizing people's training profile. Eg, if you pick some big nodes in different fields (chemistry, physics, math, anthropology, etc), then you can easily visualize how far their training is from each of those different areas. So I'll be thinking about ways to integrate fingerprints and active updating of them into the bigger database.

On Thu, Feb 18, 2016 at 7:39 PM, Shreejoy Tripathy <notifications@github.com

wrote:

Another thought: For fingerprint vectors, we may not need to populate/analyze the complete N x N NE author matrix. Probably if we found M interesting nodes (4 grandfathers or maybe a bigger set), we could just generate the N x M matrix. Of distances to them. This would probably provide as much useful information and would be faster to analyze and generate.

Yeah - I agree. @rgerkin https://github.com/rgerkin I think that @svdavid https://github.com/svdavid is proposing this mostly as a practicality thing. My guess is that it's not computationally trivial to generate the full N x N NE author matrix (for whatever reason) but it'd be much quicker just to generate the N x M matrix. @svdavid https://github.com/svdavid am I right in this?

@stripathy https://github.com/stripathy Right now I am enduring the grind of making all of your code work in Python 3. Mostly string encodings and such.

Ah sorry! Hopefully it's not too painful. This side project seems like a good excuse to migrate over to python 3.

— Reply to this email directly or view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#issuecomment-186032672 .

rgerkin commented 8 years ago

If you are going to pick some key nodes on which to build the reduced fingerprint, will you just take the N nodes with the most descendants (that have papers in neuroelectro)? Or the N that maximize some other metric?

svdavid commented 8 years ago

Haven't decided on what strategy to use yet, but that sound like a good one. Possibly could do some sort of eigenvector-like reduction of the big matrix to find important nodes. Thing about just picking based on big decedent counts is that you tend to get a lot of nodes that are really close to each other.

For now, we can just stick with the big matrix, since that's worth having as a baseline for testing any reduction.

On Thu, Feb 18, 2016 at 10:56 PM, Richard C Gerkin <notifications@github.com

wrote:

If you are going to pick some key nodes on which to build the reduced fingerprint, will you just take the N nodes with the most descendants (that have papers in neuroelectro)? Or the N that maximize some other metric?

— Reply to this email directly or view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#issuecomment-186090293 .

rgerkin commented 8 years ago

Possibly this would be solved by some eigenvector approach (although I don't know what that looks like in a directed graph), but another approach would be:

Find the node with the most descendants.
Remove all of those descendants from the tree.
Repeat step 1.
Thereby finding a lot of nodes that don't overlap much.

rgerkin commented 8 years ago

@stripathy The notebook is working for me now after a few changes (747cf1e). I'll tackle the fingerprint construction and use in model fitting next.

svdavid commented 8 years ago

@rgerkin The most-prolific pruning option might work. Only complication there would be if one or two people were ancestors for basically everyone in the tree. The tree does get narrow toward the top.

So, yeah, we do want to find hubs with non-overlapping decedents, but not necessarily a lot of them. Maybe some sort of clustering and then pick the best exemplars from each cluster?

For an NE specific analysis, we could reference backto the pub data, eg, define people with the greatest average methodological differences as hubs. This might be too tautological.

Another passing thought: The hubs don't have to be in the NE pub matrix. Though maybe you've got papers from all the grandparents?

Need to sleep on it.

On Thu, Feb 18, 2016 at 11:51 PM, Richard C Gerkin <notifications@github.com

wrote:

@stripathy https://github.com/stripathy The notebook is working for me now after a few changes (747cf1e https://github.com/neuroelectro/neuroelectro_neurotree/commit/747cf1e237175c946fba9df7081a146e60787eb9). I'll tackle the fingerprint construction and use in model fitting next.

— Reply to this email directly or view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#issuecomment-186107260 .

stripathy commented 8 years ago

Another passing thought: The hubs don't have to be in the NE pub matrix. Though maybe you've got papers from all the grandparents?

Most of the papers indexed in NE are published after 1997, after journals switched to HTML from PDF. So NE doesn't usually have papers from the uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or Huxley, etc.

While our immediate goal is to integrate NE with NT, maybe for the purpose of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want to include at a minimum all the NE authors + all their grandfathers going back at least 2-3 hops. Remembering back, I think relationships follow sort of an "out of Africa" type model, where if you go back far enough there's really just 10 neuroscientists that everyone trained with. But if you go back only 2-3 generations (i.e., who trained the people who published in NE), there really are specific neurosci schools, centered around a small number of people (like Sakmann, Llinas, David Prince, etc).

svdavid commented 8 years ago

Ha... That's an interesting idea. We can find the minimum set of grandparents, ie, no more than X hops back from the NE set that span the entire set. Fairly objective and should automatically give us the diversity we need. Need to do a little coding to take care of that. Which may happen today if I can get my Cosyne poster done soon.

For some reason my distance matrix calculator slowed down overnight. Annoying. But it's getting there.

On Fri, Feb 19, 2016 at 10:18 AM, Shreejoy Tripathy < notifications@github.com> wrote:

Another passing thought: The hubs don't have to be in the NE pub matrix. Though maybe you've got papers from all the grandparents?

Most of the papers indexed in NE are published after 1997, after journals switched to HTML from PDF. So NE doesn't usually have papers from the uber-grandfathers of ephys like Eccles, Sherrington, Kufler, Hodgkin or Huxley, etc.

While our immediate goal is to integrate NE with NT, maybe for the purpose of indexing NT paths, or the intrisinc ephys relevant part of NT, we'd want to include at a minimum all the NE authors + all their grandfathers going back at least 2-3 hops. Remembering back, I think relationships follow sort of an "out of Africa" type model, where if you go back far enough there's really just 10 neuroscientists that everyone trained with. But if you go back only 2-3 generations (i.e., who trained the people who published in NE), there really are specific neurosci schools, centered around a small number of people (like Sakmann, Llinas, David Prince, etc).

— Reply to this email directly or view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#issuecomment-186344533 .

svdavid commented 8 years ago

I broke down and rewrote the pairwise distance code and now it's running at a sane speed (ie, ~ 100x faster). I also noticed that it's interesting to look at who shows up as a common ancestor. I'm now recording that as "p0" in the output of http://neurotree.org/beta/include/dist_mtx.php.

If I look at the most frequent occurrences of common ancestors, I get a list of the usual suspects. Maybe these are good fingerprint hubs? This is what I see for the first 200 or so NE people:

mysql> select p0, count(p1),people.firstname,people.lastname from pairDist left join people on p0=pid where p0>0 group by p0 order by count(p1) desc limit 30; +-------+-----------+-------------+---------------+ | p0 | count(p1) | firstname | lastname | +-------+-----------+-------------+---------------+ | 114 | 10295 | Sir John | Eccles | | 115 | 6201 | Sir Charles | Sherrington | | 1713 | 2705 | John | Langley | | 172 | 2641 | Sir Michael | Foster | | 151 | 2134 | Johannes | Müller | | 146 | 2034 | Hermann | von Helmholtz | | 223 | 1396 | Carl | Ludwig | | 517 | 1370 | Ernst | Weber | | 3011 | 1179 | Rudolf | Virchow | | 65 | 1050 | Stephen | Kuffler | | 116 | 962 | Edgar | Adrian | | 119 | 741 | Karl | Lashley | | 6684 | 671 | Friedrich | Goltz | | 511 | 650 | Franz | Nissl | | 195 | 631 | Henry | Bowditch | | 1857 | 606 | David | Prince | | 122 | 589 | James | Angell | | 196 | 573 | Claude | Bernard | | 4339 | 544 | Otto | Meyerhof | | 135 | 489 | Bert | Sakmann | | 134 | 471 | Otto | Creutzfeldt | | 1716 | 448 | Archibald | Hill | | 188 | 432 | Philip | Bard | | 204 | 402 | John | Fulton | | 524 | 350 | Thomas | Huxley | | 206 | 349 | Harvey | Cushing | | 812 | 335 | Roger | Nicoll | | 21405 | 280 | Robert | Bunsen | | 171 | 278 | Bernard | Katz | | 1888 | 270 | Oswald | Schmiedeberg | +-------+-----------+-------------+---------------+

svdavid commented 8 years ago

matrix is complete! Of course, about half the pairs (60K/120K) are not connected (yet!)

stripathy commented 8 years ago

@svdavid I get a 500 error upon trying to load this page: http://neurotree.org/beta/include/dist_mtx.php. If it's a big file, you could just add it to the github repo.

stripathy commented 8 years ago

From looking at this: http://neurotree.org/neurotree/tree.php?pid=115&fontsize=0&pnodecount=4&cnodecount=2 , if my "Out of Africa" hypothesis is true, then Eccles and Sherrington are basically Africa.

stripathy commented 8 years ago

Thanks @svdavid for commiting this: https://github.com/neuroelectro/neuroelectro_neurotree/commit/8ed9ba7af2e186843725ead77a7e47ef00ea24da, I'm closing this issue for now.

nathaliebin commented 7 years ago

Hi @svdavid,

Could you please update the distance matrix displayed here: http://neurotree.org/beta/include/dist_mtx.php with the neurotree author PIDs in this file: UniquePID.txt

We have updated the listing of authors in NeuroElectro and we noticed that not all of these authors had corresponding entries in the distance matrix output that you had previously generated.

Thanks, @nathaliebin and @stripathy

svdavid commented 7 years ago

@nathaliebin : got your request, and working on it! I've been traveling and tied up with other stuff. Should be able to get to it soon.

stripathy commented 7 years ago

hi @svdavid have you been able to update this yet?

svdavid commented 7 years ago

Ok! The table is updating. When I merge the new set of pids and previously included pids, I get 1147 distinct nodes. You might check the list that you get out of http://neurotree.org/beta/include/dist_mtx.php and make sure it includes all the pids you need. It's taking a little while, so the 1147 x 1147 distance matrix may not be completely done til late tonight.

svdavid commented 7 years ago

Now there are too many entries and dist_mtx is running out of memory. Do you have a mysql client you can use? I can give you a query to pull directly from the database.

stripathy commented 7 years ago

Yes - there's a mysql client we can use. Feel free to send me credentials to login to your database or a dump of the database that we can load up here locally.

On Wed, Nov 2, 2016 at 5:50 PM Stephen D notifications@github.com wrote:

Now there are too many entries and dist_mtx is running out of memory. Do you have a mysql client you can use? I can give you a query to pull directly from the database.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/neuroelectro/neuroelectro_neurotree/issues/8#issuecomment-258042524, or mute the thread https://github.com/notifications/unsubscribe-auth/ACWEWXebakbTTHcQtgo-BdH2MEmpeXCJks5q6S_ZgaJpZM4HchOP .

Shreejoy Tripathy Post-Doctoral Researcher Department of Psychiatry University of British Columbia

svdavid commented 7 years ago

@nathaliebin @stripathy : OK, give this a whirl. And let me know how it goes.

To connect: host=klab.c3se0dtaabmj.us-west-2.rds.amazonaws.com user=dacuna pw=dacuna database=academictree

Then run the query: select * from pairDistNE;

p1,p2 = pid of node pair d = distance between them through common ancestor (d=-1 means no connection) p0 = pid of common ancestor

neuroelectro / neuroelectro_neurotree

Generate a matrix of pairwise path lengths for all neuroelectro authors #8