talend-spatial / workspace-metadata-crawler

Automatic geospatial data inventory with Talend Spatial
10 stars 6 forks source link

Metadata / identifier strategy / Avoid to build same identifier #1

Closed fxprunayre closed 9 years ago

fxprunayre commented 11 years ago

Current strategy is to build metadata identifier based on file name. If 2 files have same name, this will produce same record identifier. This was done on purpose in order to be able to update metadata record on future scan.

In some case, there will be duplicate filename and another strategy should be used.

Options: 1) Build on filepath (but if the file location change, a new identifier will be generated) 2) ?

mcoudert commented 11 years ago

For some RASTER products, the filename is always the same but the parent folder allows to identify the product. It could be an alternative for such products?

sppigot commented 10 years ago

Yep - currently uuid of file is generated from sha of filename. Would be nice if sha included date of modification as well (since some files will have the same name). Currently title is filename, but directory path often includes useful metadata - so directory path (minus context data_dir) would be better to use as title of metadata record.

Ultimately it might be better to have the crawler running as a thread which monitors the context data_dir using java7 watchdir and:

fxprunayre commented 9 years ago

Add an option to configure the strategy between the following 2 options:

Setting name is : uuidCreatedFromFilePath (if true, the default, the strategy is based on filepath)

Eg.

fxprunayre commented 9 years ago

Closing even if more advanced strategy could be adopted.