steffenfritz / FileTrove

FileTrove indexes files and creates metadata from them.
https://filetrove.fritz.wtf
GNU Affero General Public License v3.0
28 stars 5 forks source link

[CHANGE] Align the Data Model with Wikidata Items like "file system object" #30

Closed dla-kramski closed 4 months ago

dla-kramski commented 6 months ago

Wikidata is playing an increasingly important role in digital preservation.

FileTrove should align its data model with this de facto standard (see https://www.wikidata.org/wiki/Q37787110 and related pages):

file system object (Q37787110)
    part of
        file system (Q174989)
    has characteristic: 
       path ((Q817765))
           has part(s):
                path separator (Q64826685)
                filename (Q1144928)
                    has part(s):
                        filename extension (Q186157)

This could be implemented with the following changes:

Session table

Files and directories tables

This information could also be obtained by parsing the existing "filename" column afterwards, but ftrove has it all to hand on run time and can easily record it. The filename without the path may be particularly useful for tracking files with the same name across several sessions.

steffenfritz commented 6 months ago

This is a bigger change but it makes sense and this will be implemented. Added as a milestone for v1.0.0

steffenfritz commented 5 months ago

In regards of the filesystem Ross adds +1 in #43 (just don't want to lose it due to closing of that issue)

steffenfritz commented 4 months ago

@dla-kramski I am not sure about the extension of directory. As there is no definition of it and it really is just part of the name, how would you work with it? How would you define the boundaries, e.g. would "." separate the dirname and the extension? How should we handle if a dirname has more than one (arbitrarily) chosen separator? I understand that there are extensions like "SYSTEM" and such but these are more semantics for the user and do not serve a technical purpose like MIME connotations.

So, I'd like to not add dirname extensions.

steffenfritz commented 4 months ago

Added, resp. aligned with Wikidata

ToDo / To be discussed

steffenfritz commented 4 months ago

Go's filepath.Ext() is not working as expected:

filename: .hiddenfile
filepath: ../../testdata/.hiddenfile
filenameextension: .hiddenfile

The doc says: Ext returns the file name extension used by path. The extension is the suffix beginning at the final dot in the final element of path; it is empty if there is no dot.: https://pkg.go.dev/path/filepath#Ext

As the example shows, this is not a perfect approach as every hidden file on Linux/Unix without a dot extension will have it's own name as extension. There must also be the condition that the dot is not the first element in the string.

steffenfritz commented 4 months ago

Go's filepath.Ext() is not working as expected:

filename: .hiddenfile
filepath: ../../testdata/.hiddenfile
filenameextension: .hiddenfile

The doc says: Ext returns the file name extension used by path. The extension is the suffix beginning at the final dot in the final element of path; it is empty if there is no dot.: https://pkg.go.dev/path/filepath#Ext

As the example shows, this is not a perfect approach as every hidden file on Linux/Unix without a dot extension will have it's own name as extension. There must also be the condition that the dot is not the first element in the string.

https://github.com/golang/go/issues/66814

steffenfritz commented 4 months ago

golang/go#66814

As Golang will not change filepath.Ext(), I added a workaround.

steffenfritz commented 4 months ago

Regarding filesystem:

Significant changes would have to be made to automatically determine the file system. In addition, the execution might have to be carried out as root/SYSTEM, which is not desirable. I am considering introducing a flag that allows users to add the file system manually.

dla-kramski commented 4 months ago

@dla-kramski I am not sure about the extension of directory. As there is no definition of it and it really is just part of the name, how would you work with it? How would you define the boundaries, e.g. would "." separate the dirname and the extension? How should we handle if a dirname has more than one (arbitrarily) chosen separator? I understand that there are extensions like "SYSTEM" and such but these are more semantics for the user and do not serve a technical purpose like MIME connotations.

So, I'd like to not add dirname extensions.

On second thought, I'm inclined to agree.

steffenfritz commented 4 months ago