Open florisvdh opened 3 years ago
The n2khab
functions currently take a default subdirectory wich is data source specific. It is named after the data source code name, regardless of the data source version. For now this is the only approach which offers reproducible usage of directory structure between code from different users. I.e. by following the n2khab_data
directory structure documented by package vignette 020 and initialized by fileman_folders()
.
This is perfectly compatible with the read_ functions in n2khab
Indeed the user can change the function's file
argument to input version-specific paths, although from that point we loose reproducibility between collaborators as long as no general directory structure is agreed or enforced, for handling multiple versions within one n2khab_data
folder. So currently one still needs to do some local, manual file handling and use the file
argument to support multiple versions next to each other.
The general idea, for now, is that in production-scripts only one version will be used, and this version is put in the default subdirectory (as in your example).
File and version handling is planned to be further taken care of by n2khab
, but concurrent handling of multiple versions is still undiscussed. Essentially, functions will check whether the version specified by user (and to be used) is indeed present in the default subdirectory, and when user did not specify version, this check should refer to the latest version of the data source.
Handling multiple versions side-by-side, while not breaking current approach (to maintain backward compatibility with existing scripts), could indeed happen by using extra version-specific folders. They could be put either inside the data source subdirectory (cf. in your example) [1], next to it by using version codes in the folder name [2], or in a dedicated folder at a higher level [3]. In all cases the official version codes should be used, to conform with version names at Zenodo (which can be retrieved to include in the n2khab
package). Once such directory structure would be agreed and documented, then we could readily adapt functions to also support version-specific subdirectories.
E.g. a version-checking function could also search for other locally available versions, in designated default locations, if the requested version is not present in the regular default subdirectory.
Examples of possible directory structure that we could choose:
n2khab_data/
├── 10_raw/
│ └── habitatmap/ # (*)
│ └── versions/
│ └── habitatmap_2020/ # (**)
└── 20_processed/
└── habitatmap_stdized/ # (*)
└── versions/
└── habitatmap_stdized_2020_v1/ # (**)
n2khab_data/
├── 10_raw/
│ ├── habitatmap/ # (*)
│ └── habitatmap_2020/ # (**)
└── 20_processed/
├── habitatmap_stdized/ # (*)
└── habitatmap_stdized_2020_v1/ # (**)
n2khab_data/
├── 10_raw/
│ ├── _versions/
│ │ └── habitatmap/
│ │ └── habitatmap_2020/ # (**)
│ └── habitatmap/ # (*)
└── 20_processed/
├── _versions/
│ └── habitatmap_stdized/
│ └── habitatmap_stdized_2020_v1/ # (**)
└── habitatmap_stdized/ # (*)
(*) has data source files of some version - used by default (**) has data source files of a fixed version - will be used if it matches the requested version and if the latter is not available in the default location
All (and more) could be supported, but we must choose one. I like [3] best, because it separates versions most from current structure, making clear that both options are available. I'd rather not mix data files with version-specific folders as in [1]. The default data source folder should just contain a chosen version of a data source (other contents may appear as if they were part of the data source as well). Further suggestions are welcome of course.
_Originally posted by @cecileherr in https://github.com/inbo/n2khab-preprocessing/pull/50#discussion_r576740409_