an easy to use edgelist for flywire neurons

alexanderbates commented 3 years ago

I would like to build an easy to use edgelist for flywire neurons. With hemibrainr, you can access one I have made for the hemibrainr google drive:

elist = hemibrainr::hemibrain_elist()

And you get:

# Source:   lazy query [?? x 7]
# Database: sqlite 3.33.0 [/Volumes/GoogleDrive/Shared drives/hemibrainr/hemibrain_neurons/hemibrainr_data.sqlite]
        post        pre post.Label pre.Label count    norm connection       
       <dbl>      <dbl> <chr>      <chr>     <dbl>   <dbl> <chr>            
1  425790257 5813105172 dendrite   dendrite   4296 0.0339  dendrite-dendrite
2 5813105172  425790257 dendrite   dendrite   1739 0.0225  dendrite-dendrite
3  424767514 1196854070 dendrite   axon        987 0.0377  axon-dendrite    
4  424767514  393766777 dendrite   axon        951 0.0363  axon-dendrite    
5  425790257 5813022424 dendrite   axon        829 0.00654 axon-dendrite    
6  425790257 1048172314 dendrite   axon        742 0.00585 axon-dendrite

It is broken down by axon/dendrite. I could make something similar for the flywire neurons that are 'processed' on FlyEM1 each night and put it on the currently private 'hemibrain' google drive, at hemibrain/fafbsynapses.

Keep the same column names? Also add transmitter as top.nt?

jefferis commented 3 years ago

A few quick thoughts.

this is possible but it will be quite big and have a large amount of overlap with the 33GB csv files that Sven is making
I think we need to prune the long tail of weak connections / connections to small objects to make things manageable. Think removing all connections with n<=3

You need to think about the database table structure more carefully:

We need to be very careful abut the storage mode of the ids (flywire ids must be 64 bit ints not doubles or there will be lost precision / missed ids)
columns that are character probably need to be stored as factors (or as bare integers that are translated into factors) to save (a lot) of space.
count should also be an int
ideally norm should be stored with restricted floating point precision

Note that many of these issues also apply to what's in hemibrainr and will probably require you to create a SQL table definition.

jefferis commented 3 years ago

Depending on what you are doing, you will likely need to index post and pre for competitive queries. This again means that they should really be int.

jefferis commented 3 years ago

Next is for updating this table. You can expect each partner query to take about 20s on average. We cannot just run this for 10,000 neurons every night. Even 1000 will make us unpopular. You need to reduce the amount of time hitting flywire services. This means

you need to avoid the flywire_leaves calls if possible. Either by cacheing or, even more effective, by using the static versions of the flywire synapses tables and only using flywire_leaves when the rootids are not present in that table.
for mapping of partner synapses to rootids, we can process a few hundred synapses per second, but this is going to start getting slow when we have e.g. 1000 neurons by 5000 synapses (hours) and we may make ourselves unpopular. We could avoid this if we knew which rootids have been invalidated since the last dump. I have asked them about this and looks like it may be possible to get this info.

jefferis commented 1 year ago

closing as this has been superseded by Sven's dumps

natverse / fafbseg

an easy to use edgelist for flywire neurons #74