opentraveldata / geobases

Data services and visualization
http://opentraveldata.github.com/geobases/
Other
193 stars 41 forks source link

zip/bzip support #6

Closed alexprengere closed 11 years ago

alexprengere commented 11 years ago

In the same way we added http urls support in Sources.yaml, we could allow compressed files/archives description directly in the configuration file.

This would make the maintenance lighter, since we would directly store, for example, the GeoNames files in their native format, not an uncompressed version.

This would require to update the reader (easy), and update the CheckDataUpdates.sh monitoring script to display diff between the compressed files. This way, we could have versions of the package with no embedded data, only remote data fetched when necessary.

The format still has to be determined in the Sources.yaml file, because we can have several files in one archive. Currently I am thinking about something like this:

source:
    paths :
        - local/souce/file
        - https://remote/source/failover
        - [local/archive.zip file_in_archive]
        - [https://remote/source/failover.zip file_in_archive]
alexprengere commented 11 years ago

Actually the above format would not work, since paths is already a list of sources or a string describing one source, we cannot allow lists as a description for one source.

This would remove the confusion of are-we-talking-about-one-path-or-several:

source1:
    paths : local/source/file

source2:
    paths :
        - local/source/file
        - https://remote/source/failover

source3:
    paths :
        - archive:
              local/archive.zip
              file_in_archive
        - archive
              https://remote/source/failover.zip
              file_in_archive
alexprengere commented 11 years ago

I implemented it with this syntax wih the recent commits on the develop branch (like e28316b3b2a0513616e5bbe18c57b21d0de014a4 and 3edd07617822696432e6b99517d03307f9188f95)

source:
    paths:
        - archive : Por/GeoNames/MC.zip
          file    : MC.txt
        - archive : 'http://download.geonames.org/export/dump/MC.zip'
          file    : MC.txt

Zip archives are stored in the site-package like any other data source. When used, the archive is uncompressed, and the uncompressed file is kept (next to the archive, not in local directory like the remote sources). If used again, data is used from the uncompressed file, unless data has been updated and the archive is more recent than the uncompressed file (detection with os.stat).

Now the maintenance script is just updates on files, except for the features codes where some processing is done. It should be improved to display diffs between archives now.

alexprengere commented 11 years ago

I changed a bit the convention on the develop branch, now it is:

source:
    paths:
        - file : Por/GeoNames/MC.zip
          extract: MC.txt
        - file : 'http://download.geonames.org/export/dump/MC.zip'
          extract: MC.txt
        - file : any/other/file.txt
        - another/file.txt

The change is that all paths elements are converted to dictionaries after parsing, and have a file attribute. This allows some refactoring for common operations. For example, we do not have to check if the pathselement describes an archive or a simple file to download it, we just take the file attribute on the paths element.