Reviewing Path and URI Creation in the Windows Indexer

hadisinaee commented 10 months ago

When I was rewriting the indexer for the Mac machine, I reused the build_stat_dict from the Windows indexer. However, the result of the indexing in a test folder was not as expected. The following presents those results. The first command displays the result of the tree command in my test directory. The second command shows the result of printing two fields from the indexer output: Path and URI. As shown, the URI field is incorrect (ignore the jsonl file because it was created after the indexing, so it does not display the indexing results). For example, the first line suggests that temp is a subfolder of temp which is not correct.

I'm not sure if it also produces the correct indexing data in Windows. If it does, we can close this issue. Just want to bring to our attention this case.

I fixed that in my pull request #19

https://github.com/ubc-systopia/Indaleko/blob/3b00720d41f0554496036bdddffe559baf1245e0/IndalekoMacLocalIndexer.py#L89-L90

fsgeek commented 10 months ago

URI construction in Windows is one area in which it varies, since I use volume GUID based URIs instead of path based ones. The disadvantage of path based URIs is that they depend upon using the same mount point. That's the norm for some volumes, but for removable storage that's not a good assumption.

        stat_dict['Name'] = name
        stat_dict['Path'] = root
        if last_drive != os.path.splitdrive(root)[0][0].upper():
            last_drive = os.path.splitdrive(root)[0][0].upper()
            last_uri = self.convert_windows_path_to_guid_uri(root)
            assert last_uri.startswith('\\\\?\\Volume{'), \
                f'last_uri {last_uri} does not start with \\\\?\\Volume{{'
        stat_dict['URI'] = os.path.join(last_uri, os.path.splitdrive(root)[1], name)

Is the equivalent path construction in Windows, which is quite different but ignoring the bits dealing with drive letter mapping, this looks equivalent to me. The complicated "last URI" stuff deals with the drive letter, and the splitdrive on Windows separates into (drive_letter,dir_path,file_name).

fsgeek commented 10 months ago

Validating the index output should just be a matter of trying to open the files with the URI (though I haven't tried the URI on Windows through the Win32 API, it will work on the native OS API assuming it is valid.) We could do that in the ingester, but it would slow it down.

For the ingester, we'd need to define what "validate" means here - are we validating the data that gets uploaded into the databse? If so, what's an efficient way of doing this?

I ask this question because "validating" has two obvious cases that may be of issue:

(1) ensuring that everything that could be indexed is indexed. (2) ensuring that what is indexed contains correct metadata.

For example, with (1) I had some files that couldn't be indexed because they weren't accessible without privileges. I'm fine with that because if I can't access them with my credentials, they aren't really of interest to me. I logged those failures but I didn't review them all.

For (2) the challenge seems to be that we are indexing a dynamic dataset, so we don't want to flag things that have legitimately changed as failures, for example. It would also need to be very efficient.

One thought I had is that if we build a local ingester that does the checksum computations it would have to open the file to read its data and compute the checksums. That's a pretty decent validation the metadata from the indexer is, in fact, valid, since an invalid entry should lead to a failure of opening the file.

fsgeek commented 10 months ago

I just checked. It turns out the volume GUID URIs do work (with a sample size of one):

def main():
    '''
    Test this URI to see if I can open it via the Python file open function
    (probably not, will require native API?)
    "URI": "\\\\?\\Volume{e069ddd9-51ad-400c-bccd-d5433aed7ea7}\\old-system-image-2023-09-05\\Program Files\\Adobe\\Acrobat DC\\Acrobat\\plug_ins\\PaperCapture\\iDRS15\\OCRResources\\bas.ilex", "Indexer": "0793b4d5-e549-4cb6-8177-020a738b66b7", "Volume GUID": "e069ddd9-51ad-400c-bccd-d5433aed7ea7"}
    '''
    test_file = r'\\?\Volume{e069ddd9-51ad-400c-bccd-d5433aed7ea7}\old-system-image-2023-09-05\Program Files\Adobe\Acrobat DC\Acrobat\plug_ins\PaperCapture\iDRS15\OCRResources\bas.ilex'
    if os.path.exists(test_file):
        print('File exists')
        with open(test_file, 'rb') as reader:
            data = reader.read()
            print('Data length is: ', len(data))

The path name came from one of the indexer output files. When I ran the test for this I got:

PS C:\Users\TonyMason\source\repos\indaleko-test> python .\scratch.py
File exists
Data length is:  505172
PS C:\Users\TonyMason\source\repos\indaleko-test>

This is encouraging and should make building this for Windows more straight-forward.

hadisinaee commented 10 months ago

Validating the index output should just be a matter of trying to open the files with the URI (though I haven't tried the URI on Windows through the Win32 API, it will work on the native OS API assuming it is valid.) We could do that in the ingester, but it would slow it down.

Ah, I see. I thought ingesters might be running on systems other than the one the indexer is running on. So, if the ingesters could run on the same system as the indexers, then we could do it. While it may slow down the entire system, we can sample data and test just a few to ensure those files exist. Since we are using os.walk, testing a few of the indexed entries should be sufficient.

For the ingester, we'd need to define what "validate" means here - are we validating the data that gets uploaded into the databse? If so, what's an efficient way of doing this? For (2) the challenge seems to be that we are indexing a dynamic dataset, so we don't want to flag things that have legitimately changed as failures, for example. It would also need to be very efficient.

My suggestion is to extract validation tests while indexing for the ingester to examine after ingestion. For example, we can randomly pick a folder and count the number of files and folders inside it (those we could index, not the ones we couldn't index). Then, after the ingestion phase, we query to see if they are present. We could apply the same approach to relationships (both ways). Additionally, we don't need to repeat this process for all files or folders, but just for a few.

One thought I had is that if we build a local ingester that does the checksum computations it would have to open the file to read its data and compute the checksums. That's a pretty decent validation the metadata from the indexer is, in fact, valid, since an invalid entry should lead to a failure of opening the file.

Certainly, that could be another test. We can perform it on a subset of files, assuming the ingester has access to the files (running on the same machine).

fsgeek commented 10 months ago

Ah, I see. I thought ingesters might be running on systems other than the one the indexer is running on. So, if the ingesters could run on the same system as the indexers, then we could do it. While it may slow down the entire system, we can sample data and test just a few to ensure those files exist. Since we are using os.walk, testing a few of the indexed entries should be sufficient.

The local ingester can't really run anywhere else, since right now it adds the machine configuration to the indexed meta-data. I agree that this isn't a restriction for all ingesters, but any that need to interact with the original file will need to be local a machine that has access to them.

This will become interesting for cloud ingesters, because some of the operations I can do locally (like compute checksums on files, or do semantic data extraction) is not so practical when the files all have to be fetched from remote storage.

My suggestion is to extract validation tests while indexing for the ingester to examine after ingestion. For example, we can randomly pick a folder and count the number of files and folders inside it (those we could index, not the ones we couldn't index). Then, after the ingestion phase, we query to see if they are present. We could apply the same approach to relationships (both ways). Additionally, we don't need to repeat this process for all files or folders, but just for a few.

What I had missed was your concern that the edge collection is somehow broken. I haven't really done anything with the edge collection yet, I just wanted to make sure I had a model for gathering and storing that information/data. Even the searches that I did haven't involved looking through relationships yet.

Of course, sample validation will ensure that the basic logic works, but it won't necessarily detect more subtle or complex issues should they arise. For example, I don't expect to ever reconstruct a root to branch traversal (or vice versa, since we have the inverse) because the index isn't a file system and the hierarchical structure isn't really so interesting. In the end it might be good to have both a simple test mechanism to detect when we've broken something, but also to have a more robust model for validating the database is consistent.

ubc-systopia / Indaleko

Reviewing Path and URI Creation in the Windows Indexer #20