inbo / n2khab

R package with preprocessing functions and standard reference data for Flemish Natura 2000 (N2K) habitat (HAB) analyses
https://inbo.github.io/n2khab
GNU General Public License v3.0
2 stars 1 forks source link

Handling multiple data source versions at the same time #113

Open florisvdh opened 3 years ago

florisvdh commented 3 years ago

A side comment (see it as something to keep in mind if we add this type of version control to n2khab) : Am I the only one who keeps several old versions of the maps locally? My file structure looks like this (it is probably not how it should be, this is a quick fix since there is no version control yet )

"10_raw/habitatmap/habitatmap.dbf"                  
"10_raw/habitatmap/habitatmap.lyr"                  
"10_raw/habitatmap/habitatmap.prj"                  
"10_raw/habitatmap/habitatmap.shp"                  
"10_raw/habitatmap/habitatmap.shx"                  
"10_raw/habitatmap/v2020_20201211.txt"              
"10_raw/habitatmap/versies/v2018/habitatmap.dbf"    
"10_raw/habitatmap/versies/v2018/habitatmap.lyr"    
"10_raw/habitatmap/versies/v2018/habitatmap.prj"    
"10_raw/habitatmap/versies/v2018/habitatmap.shp"    
"10_raw/habitatmap/versies/v2018/habitatmap.shx"    
"10_raw/habitatmap/versies/v2018/v2018_20190205.txt"

This is perfectly compatible with the read_ functions in n2khab (but not with this generate function and it does not matter, we only need to run it once and it is not a job for the standard user)

_Originally posted by @cecileherr in https://github.com/inbo/n2khab-preprocessing/pull/50#discussion_r576740409_

florisvdh commented 3 years ago

The n2khab functions currently take a default subdirectory wich is data source specific. It is named after the data source code name, regardless of the data source version. For now this is the only approach which offers reproducible usage of directory structure between code from different users. I.e. by following the n2khab_data directory structure documented by package vignette 020 and initialized by fileman_folders().

This is perfectly compatible with the read_ functions in n2khab

Indeed the user can change the function's file argument to input version-specific paths, although from that point we loose reproducibility between collaborators as long as no general directory structure is agreed or enforced, for handling multiple versions within one n2khab_data folder. So currently one still needs to do some local, manual file handling and use the file argument to support multiple versions next to each other.

The general idea, for now, is that in production-scripts only one version will be used, and this version is put in the default subdirectory (as in your example).

File and version handling is planned to be further taken care of by n2khab, but concurrent handling of multiple versions is still undiscussed. Essentially, functions will check whether the version specified by user (and to be used) is indeed present in the default subdirectory, and when user did not specify version, this check should refer to the latest version of the data source.

Handling multiple versions side-by-side, while not breaking current approach (to maintain backward compatibility with existing scripts), could indeed happen by using extra version-specific folders. They could be put either inside the data source subdirectory (cf. in your example) [1], next to it by using version codes in the folder name [2], or in a dedicated folder at a higher level [3]. In all cases the official version codes should be used, to conform with version names at Zenodo (which can be retrieved to include in the n2khab package). Once such directory structure would be agreed and documented, then we could readily adapt functions to also support version-specific subdirectories.

E.g. a version-checking function could also search for other locally available versions, in designated default locations, if the requested version is not present in the regular default subdirectory.

Examples of possible directory structure that we could choose:

n2khab_data/
    ├── 10_raw/
    │     └── habitatmap/  # (*)
    │           └── versions/
    │                 └── habitatmap_2020/  # (**)
    └── 20_processed/
          └── habitatmap_stdized/ # (*)
                └── versions/
                      └── habitatmap_stdized_2020_v1/  # (**)
n2khab_data/
    ├── 10_raw/
    │     ├── habitatmap/ # (*)
    │     └── habitatmap_2020/ # (**)
    └── 20_processed/
          ├── habitatmap_stdized/ # (*)
          └── habitatmap_stdized_2020_v1/ # (**)
n2khab_data/
    ├── 10_raw/
    │     ├── _versions/
    │     │      └── habitatmap/
    │     │             └── habitatmap_2020/ # (**)
    │     └── habitatmap/ # (*)
    └── 20_processed/
          ├── _versions/
          │      └── habitatmap_stdized/
          │             └── habitatmap_stdized_2020_v1/ # (**)
          └── habitatmap_stdized/ # (*)

(*) has data source files of some version - used by default (**) has data source files of a fixed version - will be used if it matches the requested version and if the latter is not available in the default location

All (and more) could be supported, but we must choose one. I like [3] best, because it separates versions most from current structure, making clear that both options are available. I'd rather not mix data files with version-specific folders as in [1]. The default data source folder should just contain a chosen version of a data source (other contents may appear as if they were part of the data source as well). Further suggestions are welcome of course.