Use floWeaver for Sankey diagrams

ricklupton commented 6 years ago

Hi Konstantin,

It'd be great if you'd like to use floweaver to add a Sankey diagram to the article/examples. I had a quick go at adding it here:

https://hub.mybinder.org/user/ricklupton-pymrio_article-onl8r169/notebooks/notebook/pymrio-tutorial-for-wiod.ipynb

Not sure if that's the kind of diagram you had in mind? Happy to discuss how to change it if anything's not clear.

wiod_sankey

konstantinstadler commented 6 years ago

Looks great - however, I need a password to access the mybinder

ricklupton commented 6 years ago

Sorry that was the link to the running mybinder instance -- here is the right link:

https://mybinder.org/v2/gh/ricklupton/pymrio_article/master?filepath=notebook%2Fpymrio-tutorial-for-wiod.ipynb

konstantinstadler commented 6 years ago

ok, I condensed the code a little (see branch sankey and the notebook there).

konstantinstadler commented 6 years ago

However, this is a special case as the flows are aggregated to region already.

The generic output of the flow matrix would be as line 20 in the example notebook (region-sector in columns and rows). Would be good if we could automatically make a flow chart for that.

I assume, we need to put the strings in region and sector together (or does floweaver handle multiindex?). Thus region sector would become something like source - target with 'Other - AtB', 'Other - C' as in the WIOD example.
Where do you see the best place to add the either import or export function. I would prefer to have a solution which produces sankeys with one command. I see the following possibilities: (a) pymrio contains a function 'make_sankey' which relies on floweaver. Not so keen on that as it requires an explicit dependency (b) pymrio exports a (set of) files which floweaver can handle with one command (basically floweaver would need to be able to import a saved extension of pymrio) (c) pymrio exports a dictionary with all data floweaver needs to build a sankey (d) floweaver accepts a pymrio.extension object as parameter and does the rest (the reverse of (a))

ricklupton commented 6 years ago

You're right, this wasn't really using any of the aggregation in floweaver. This is using the full data to make the same picture:

import floweaver as fw

# Convert matrix to source-target format
flows = wiod.CH4_source.D_cba.unstack(level=[0, 1])
flows.index.set_names(['r1', 's1', 'r2', 's2'], inplace=True)
flows = flows.reset_index(name='value')
flows['source'] = flows['r1'] + '-' + flows['s1']
flows['target'] = flows['r2'] + '-' + flows['s2']

regions = list(flows.r1.unique())
sectors = list(flows.s1.unique())

partition_receiving = fw.Partition.Simple('r2', regions)
partition_source = fw.Partition.Simple('r1', regions)

sdd = fw.SankeyDefinition(
    nodes={'sources': fw.ProcessGroup(list(flows.source.unique()), partition_source, title='impact in'),
           'targets': fw.ProcessGroup(list(flows.target.unique()), partition_receiving, title='consumption in')},
    bundles=[fw.Bundle('sources', 'targets')],
    ordering=[['sources'], ['targets']],
    flow_partition=partition_source)
fw.weave(sdd, flows).to_widget()

But more interesting would be to show sectors and regions together:

import floweaver as fw

# Convert flow matrix to source-target format
flows = wiod.CH4_source.D_cba.unstack(level=[0, 1])
flows.index.set_names(['r1', 's1', 'r2', 's2'], inplace=True)
flows = flows.reset_index(name='value')
flows['source'] = flows['r1'] + '-' + flows['s1']
flows['target'] = flows['r2'] + '-' + flows['s2']

regions = list(flows.r1.unique())
sector_groups = [
    ('Sector group 1', ['AtB', 'C']),
    ('Sector group 2', ['15t16', '17t18']),
    # other as "_" by default
]

sdd = fw.SankeyDefinition(
    nodes={'sources': fw.ProcessGroup(list(flows.source.unique()), fw.Partition.Simple('r1', regions), title='impact in'),
           'source_sectors': fw.Waypoint(fw.Partition.Simple('s1', sector_groups)),
           'target_sectors': fw.Waypoint(fw.Partition.Simple('s2', sector_groups)),
           'targets': fw.ProcessGroup(list(flows.target.unique()), fw.Partition.Simple('r2', regions), title='consumption in')},
    bundles=[fw.Bundle('sources', 'targets', waypoints=['source_sectors', 'target_sectors'])],
    ordering=[['sources'], ['source_sectors'], ['target_sectors'], ['targets']],
    flow_partition=fw.Partition.Simple('r1', regions))
fw.weave(sdd, flows).to_widget(width=800, height=400)

screenshot-2018-2-15 pymrio-tutorial-for-wiod 1

I don't know what the sectors mean but I guess you can replace sector_groups with something sensible!

ricklupton commented 6 years ago

For where the import/export function should live, I think it would make sense for floweaver to accept a matrix in the form pymrio provides, as that's a perfectly reasonable and generic way to describe a graph. Then you could write something like this:

flows = fw.Dataset.from_matrix(wiod.CH4_source.D_cba)

Is there one obvious Sankey diagram you want from the MRIO data, or are there many? Just within the basic structure above, you can

Change the order (region 1, sector 1, sector 2, region 2) in the example above to (sector 1, region 1, region 2, sector 2) or (sector 1, sector 2) or (region 1, region 2)
Change which attribute is used to set the colours that flow through the diagram (in the example above, it's sector 1 but could be any of the others)
Group sectors at different levels of detail

I think it would make sense if pymrio could provide lists in the format of sector_groups for different levels of grouping -- or in some equivalent format if necessary?

Doing the above would make it simpler to define custom Sankey diagrams using pymrio data, but wouldn't get you to one command. But the fw.SankeyDefinition is very specific to the application, so I don't think it would make sense to build that into floweaver. Perhaps pymrio can have a soft dependency on floweaver, so it can do one-command Sankey diagrams if floweaver is installed, but not require it as a dependency?

konstantinstadler commented 6 years ago

Adding to the code examples: the second graph looks great but I something is wrong. The pymrio flow matrix and its multiindex should be interpreted as country1-sector1 to ... country1-sector2 to ... Thus, before the passage from the source to the target all "flows" from a country should go to the sector within this country (vice versa for the target).

I tried to come up with my own solution, but even for the small example it more or less freezes the notebook:

import floweaver as fw

# Convert matrix to source-target format
flows = wiod.CH4_source.D_cba.unstack(level=[0, 1])
flows.index.set_names(['r1', 's1', 'r2', 's2'], inplace=True)
flows = flows.reset_index(name='value')
flows['source'] = flows['r1'] + '-' + flows['s1']
flows['target'] = flows['r2'] + '-' + flows['s2']

partition_receiving = fw.Partition.Simple('target', list(flows.target.unique()))
partition_source = fw.Partition.Simple('source', list(flows.source.unique()))

sdd = fw.SankeyDefinition(
    nodes={'sources': fw.ProcessGroup(list(flows.source.unique()), partition_source, title='impact in'),
           'targets': fw.ProcessGroup(list(flows.target.unique()), partition_receiving, title='consumption in')},
    bundles=[fw.Bundle('sources', 'targets')],
    ordering=[['sources'], ['targets']],
    flow_partition=partition_source)
fw.weave(sdd, flows).to_widget()

The figure is also to crowded to be useful. Perhaps a region-sector source and target region is just to much for a sankey...

Regarding the implementation. Yes, that makes sense. Perhaps such a command from_matrix could accept a dict which could specify the title (e.g. what is now done in nodes) and other parameters. Alternatively, I could provide a method build_floweaver_data which builds a json?, dict? which includes all data for building a sankey.

ricklupton commented 6 years ago

It'd be good to optimise this so it doesn't freeze up, but I think the main problem is it's too much detail so see what's going on anyway!

I think it'd be more useful with some more aggregated sector groups?

regions = list(flows.r1.unique())
sector_groups = [
    ('Sector group 1', ['AtB', 'C']),
    ('Sector group 2', ['15t16', '17t18']),
    ('Other', [s for s in flows.s1.unique() if s not in ['AtB', 'C', '15t16', '17t18']])
]
sector_groups_by_region = [
    ('%s / %s' % (r, k), ['%s-%s' % (r, s) for s in sectors_in_group])
    for r in regions 
    for k, sectors_in_group in sector_groups
]

sdd = fw.SankeyDefinition(
    nodes={'sources': fw.ProcessGroup(list(flows.source.unique()), fw.Partition.Simple('r1', regions), title='impact in'),
           'source_sectors': fw.Waypoint(fw.Partition.Simple('source', sector_groups_by_region)),
           'target_sectors': fw.Waypoint(fw.Partition.Simple('target', sector_groups_by_region)),
           'targets': fw.ProcessGroup(list(flows.target.unique()), fw.Partition.Simple('r2', regions), title='consumption in')},
    bundles=[fw.Bundle('sources', 'targets', waypoints=['source_sectors', 'target_sectors'])],
    ordering=[['sources'], ['source_sectors'], ['target_sectors'], ['targets']],
    flow_partition=fw.Partition.Simple('r1', regions))
fw.weave(sdd, flows).to_widget(width=800, height=400)

screenshot-2018-2-21 pymrio-tutorial-for-wiod

Or if you don't want the two-stage grouping:

sdd = fw.SankeyDefinition(
    nodes={'sources': fw.ProcessGroup(list(flows.source.unique()), fw.Partition.Simple('source', sector_groups_by_region), title='impact in'),
           'targets': fw.ProcessGroup(list(flows.target.unique()), fw.Partition.Simple('target', sector_groups_by_region), title='consumption in')},
    bundles=[fw.Bundle('sources', 'targets')],
    ordering=[['sources'], ['targets']],
    flow_partition=fw.Partition.Simple('r1', regions))
fw.weave(sdd, flows).to_widget(width=700, height=500, margins=dict(left=200, right=200))

screenshot-2018-2-21 pymrio-tutorial-for-wiod 1

Is that what you had in mind? Obviously my sector groups are not the right ones.

konstantinstadler / pymrio_article

Use floWeaver for Sankey diagrams #1