opentraveldata / geobases

Data services and visualization
http://opentraveldata.github.com/geobases/
Other
193 stars 41 forks source link

A word on join clauses #10

Closed alexprengere closed 11 years ago

alexprengere commented 11 years ago

I have been working on join clauses for a few days, and I think this is coming to an end. Here is some information about it.

Join can be specified in the configuration like this (example from ori_por):

ori_por:
    subdelimiters :
        tvl_por_list     : ','
    join:
        - fields: country_code
          with  : [countries, code]
        # autojoin!
        - fields: tvl_por_list
          with  : [ori_por, iata_code]

As you can see, a join is a list of mappings with two keys:

A join is possible on the same base, and on multiple fields at once. It is even compatible with subdelimiters. The Python API has been slightly changed to seemlessly integrate this new notion to the get method:

>>> from GeoBases import GeoBase
>>> g = GeoBase('ori_por', verbose=False)
>>> # usual call, no join
>>> g.get('NCE', 'country_code')
'FR'
>>>  # joined with 'countries' base
>>> g.get('NCE', 'country_code', ext_field='name')
('France',)

In the last example, we get a tuple ('France',) because we look for any match on the country_code 'FR' in the base countries, and there could be several.

What happens with subdelimiters?

>>> from GeoBases import GeoBase
>>> g = GeoBase('ori_por', verbose=False)
>>> # usual call, no join (sub-delimited, gives a tuple)
>>> g.get('MOW', 'tvl_por_list')
>>> ('BKA', 'DME', 'JQF', 'JQO', 'SVO', 'VKO', 'XRK', 'ZKD')
>>>  # autojoined with 'ori_por' base
>>> g.get('MOW', 'tvl_por_list', ext_field='name')
(('Bykovo Airport',),
 ('Domodedovo International Airport',),
 ('Moscow RU Savelovsky Railway S',),
 ('Moscow RU Belorussky Railway S',),
 ('Sheremetyevo International Airport',),
 ('Vnukovo International Airport',),
 ('Moscow RU Paveletsky Rail Stn',),
 ('Moscow RU Leningradsky Rail St',))

The join has been performed on every subdelimited value, thus the tuple of tuple structure. I will not detail the multiple-fields join, it works the same way as above, except matching is done on several fields at ones. Bonus [tricky]: if join is made on several fields and several of them have subdelimiters, the cartesian product of all possible values from the different subdelimited fields is made.

The CLI integration has been made, you can specify a join clause next to the header names with -i. For now this is useless (except for debugging), since you cannot specify external fields on get calls from CLI.

One-column-example file reading from stdin:

$  echo ORY\\nCDG | GeoBase -i " " 'origin{ori_por:iata_code}'

I plan to modify the map visualization to integrate these changes. The goal is to have some kind of cleverness when a data has no geocode, but has some fields which are joined on other bases who do, and the visualization should adapt to those objects.

For example, the previous shell command displayed with --map will not display anything today. But since ORY and CDG are joined to ori_por we could get their geocodes there, and draw the object in a specific way, depending on the topology (indeed when you perform the join you may get tuples of geocodes on different fields).

Please keep in mind that these changes are on the develop branch, and I may change everything twice before it's released, so do not rely on those examples for production stuff.

alexprengere commented 11 years ago

I just added the possibility to specify external fields on get calls from CLI, using the syntax field:external_field.

This makes the following commands possible:

$ GeoBase PAR -s tvl_por_list tvl_por_list:name
tvl_por_list             ('BVA', 'CDG', 'JDP', ...
tvl_por_list:name        (('Beauvais-Tilles',), ('Paris Charles de Gaulle', ...

More interesting, it is now possible to combine this with the previous explanation above on how-to-make-CLI-join-clauses with the syntax header{join_base:join_field}, thus allowing to very easily extract data from external bases. Example with one-column file:

$ cat data.csv
ORY
CDG
$ cat data.csv | GeoBase -i " " 'origin{ori_por:iata_code}' -s origin origin:name  
origin                  CDG                                     ORY
origin:name             ('Paris - Charles-de-Gaulle',)          ('Paris-Orly',)

This is available on the develop branch or on the last GeoBasesDev package.

alexprengere commented 11 years ago

I pushed some commits (mainly 2064d8d02c3fc814c53ee91c2d9a8e216c0b762f) in the develop branch. These ones handle the join visualization on a map. If the GeoBase object has no geocode support, it tries to look on any join field if there is geocode support, then perform a cartesian product of the different joined values.

Long story short, this can draw lines:

$ cat data.csv
NCE PAR
LYS BOD
$ cat data.csv | GeoBase -i " " "origin{ori_por:iata_code}/destin{ori_por:iata_code}" --map

The UI is still a bit beta-i, but the main idea is out there.