Closed alexprengere closed 11 years ago
Actually the above format would not work, since paths
is already a list
of sources or a string
describing one source, we cannot allow lists as a description for one source.
This would remove the confusion of are-we-talking-about-one-path-or-several:
source1:
paths : local/source/file
source2:
paths :
- local/source/file
- https://remote/source/failover
source3:
paths :
- archive:
local/archive.zip
file_in_archive
- archive
https://remote/source/failover.zip
file_in_archive
I implemented it with this syntax wih the recent commits on the develop
branch (like e28316b3b2a0513616e5bbe18c57b21d0de014a4 and 3edd07617822696432e6b99517d03307f9188f95)
source:
paths:
- archive : Por/GeoNames/MC.zip
file : MC.txt
- archive : 'http://download.geonames.org/export/dump/MC.zip'
file : MC.txt
Zip archives are stored in the site-package like any other data source. When used, the archive is uncompressed, and the uncompressed file is kept (next to the archive, not in local directory like the remote sources). If used again, data is used from the uncompressed file, unless data has been updated and the archive is more recent than the uncompressed file (detection with os.stat
).
Now the maintenance script is just updates on files, except for the features codes where some processing is done. It should be improved to display diffs between archives now.
I changed a bit the convention on the develop
branch, now it is:
source:
paths:
- file : Por/GeoNames/MC.zip
extract: MC.txt
- file : 'http://download.geonames.org/export/dump/MC.zip'
extract: MC.txt
- file : any/other/file.txt
- another/file.txt
The change is that all paths
elements are converted to dictionaries after parsing, and have a file
attribute.
This allows some refactoring for common operations. For example, we do not have to check if the paths
element describes an archive or a simple file to download it, we just take the file
attribute on the paths
element.
In the same way we added
http
urls support inSources.yaml
, we could allow compressed files/archives description directly in the configuration file.This would make the maintenance lighter, since we would directly store, for example, the GeoNames files in their native format, not an uncompressed version.
This would require to update the reader (easy), and update the
CheckDataUpdates.sh
monitoring script to display diff between the compressed files. This way, we could have versions of the package with no embedded data, only remote data fetched when necessary.The format still has to be determined in the
Sources.yaml
file, because we can have several files in one archive. Currently I am thinking about something like this: