Improve and rename persistentID to trackID

lfcnassif commented 2 years ago

This is an ID that doesn't change between different runs used when resuming processing to skip already processed items. It is built hashing different concatenated IDs: path, idInDataSource (eg. sleuthID, ad1ID, ufdrID), subitemId, parentContainerPersistentID.

Current name isn't intuitive. Although it is not unique across different cases (it is not an UUID), changing it to globalID seems more user friendly. Any other suggestion?

hauck-jvsh commented 2 years ago

Just let me if you change this name, as it is used in the SARD project.

lfcnassif commented 2 years ago

Just saw we are not using the datasource UUID in the computation. If we include it, this ID would really work like an UUID and will be unique across cases, that would be great. Renaming to globalID then will totally make sense.

lfcnassif commented 2 years ago

parentContainerPersistentID

Just an explanation why this is used. We can have 2 different files with this same path (one allocated and other deleted): /root/a/b.zip/c/d.txt /root/a/b.zip/c/d.txt

The two d.txt files have no idInDatasource (they are subitems from zip, they aren't allocated) and their subitemId could be equal if they were extracted from 2 different b.zip files (one allocated and other deleted). So b.zip globalID (that uses b.zip IdInDatasource) must be used in the computation of d.txt globalID

lfcnassif commented 2 years ago

Just saw we are not using the datasource UUID in the computation. If we include it, this ID would really work like an UUID and will be unique across cases, that would be great.

Thinking better about this, do we need an "UUID" for items in iped? Would it be useful in multicases? This change would make #918 very difficult, I doubt users will know/remember they need to specify the evidence UUID when re-processing cases to import old bookmarks later. Maybe this change to include the evidence UUID in globalID computation can be made just into ElasticSearchTask, what do you think @hauck-jvsh?

hauck-jvsh commented 2 years ago

I think it could be used only in the elastic ID, I think you could also maintain the persistentID in elastic just not as _id field which must be unique.

lfcnassif commented 2 years ago

I think you could also maintain the persistentID in elastic just not as _id field which must be unique.

This was done to make the --continue option work when resuming a processing to ElasticSearch instead of having to delete a remote index and start the processing from beginning again. I think including the evidence UUID into persistentID/globalID computation should be enough to avoid _id conflicts between elastic cases.

lfcnassif commented 2 years ago

@hauck-jvsh what field are you using to store bookmarks into in elastic, _id?

edit: I mean, to correlate bookmarks to items?

hauck-jvsh commented 2 years ago

Currently I'm using the _id just to find the item and then set a new metadata with the bookmark in the item.

lfcnassif commented 2 years ago

@hauck-jvsh, I changed the attribute names persistentId->globalId, parentPersistentId->parentGlobalId, parentContainerPersistentId->containerGlobalId, and also ElasticSearchTask contentPersistentId -> contentGlobalId to follow the new naming convention.

hauck-jvsh commented 2 years ago

After that commit an error is occurring when processing cases, see the log file attached. IPED-2022-01-21-10-25-00.log

lfcnassif commented 2 years ago

Thanks @hauck-jvsh, I'll take a look.

Actually I'm still not convinced about the new globalID attribute name, since it could repeat across cases without including the evidenceUUID in the computation. As you said, we can create a real UUID for items for possible future use in a new attribute (maybe using the globalID name), I like this idea.

But about persistentID renaming, I thought about more options: fixedID, constantID, constID. What do you think? @tc-wleite do have any suggestion?

wladimirleite commented 2 years ago

After that commit an error is occurring when processing cases, see the log file attached. IPED-2022-01-21-10-25-00.log

Processing an E01 image worked, but when I tried to process a folder, got a similar exception here.

wladimirleite commented 2 years ago

But about persistentID renaming, I thought about more options: fixedID, constantID, constID. What do you think? @tc-wleite do have any suggestion?

I was following the discussions around this issue, but I am not sure what would be the best option.

lfcnassif commented 2 years ago

Processing an E01 image worked, but when I tried to process a folder, got a similar exception here.

edited: ok I'll revert that commit until I remember the full details when parentIdInDataSource is needed to compute persistentID, actually I would like to remove that property, that property makes things more complex, but I don't remember why it was implemented at first...

lfcnassif commented 2 years ago

I was following the discussions around this issue, but I am not sure what would be the best option.

Apart from naming, do you think this property should be unique across cases (not needed by iped today) or should repeat when processing the same evidence (so #918 would be easy to implement)?

wladimirleite commented 2 years ago

Well, #918 would be an useful feature. So both properties (unique across cases and repeatable) seem interesting, but is it simple to be implemented?! If it is not that complex, and doesn't add too much overhead (memory/ processing), it seems a good idea, even if these properties are not necessary right now.

lfcnassif commented 2 years ago

They are simple to implement, we could have both of them, let's do it. So, we need just to define the new name for persistentID (it's not friendly).

wladimirleite commented 2 years ago

So, we need just to define the new name for persistentID (it's not friendly).

Wouldn't it be unique across cases now? (so globalId would make sense)

lfcnassif commented 2 years ago

Wouldn't it be unique across cases now? (so globalId would make sense)

No, I didn't change the computation logic, that's why I'm thinking to revert globalID name or to rename to something else...

lfcnassif commented 2 years ago

For now, I'm inclined to constantID

lfcnassif commented 2 years ago

Another option: longID

lfcnassif commented 2 years ago

Ruback's suggestion: trackID or trackingID, I liked it.

hauck-jvsh commented 2 years ago

So If I understand correctly, there will be two properties, globalId and trackingId?

lfcnassif commented 2 years ago

So If I understand correctly, there will be two properties, globalId and trackingId?

I'm going to rename the original persistentId to trackId. We can add a globalID, it is simple.

I had to do a number of commits to completely avoid and protect against using the item ID before it could be possibly changed/updated. This could happen when subitems are committed without their parents, pointing to an inexistent parentID. When parents are processed again, their ID could be updated to the previous value, keeping the correct parent-child relationships. And item IDs were being used in a number of places before this checking and possible ID updating logic...

Other approaches would be:

updating parentIDs of committed subitems: currently this would be a costly operation without #24 where content will be indexed separately from properties. This option could be revisited later after #24;
deleting from index all items pointing to non existent parents and processing all of them again. This would avoid this confusing ID updating logic, but would take longer when resuming a processing. I decided to keep current behavior and resume as fast as possible;

lfcnassif commented 2 years ago

The scenario above could also happen with default file system items, files inside a directory could be committed and their parent folder not, depending on threading schedule.

So, all items must have a trackID and parentTrackID to allow checking the parent-child relationships, maybe parentId (int) could be completely replaced by parentTrackId in all places in the future, maybe this would make ID updating not needed anymore.

hauck-jvsh commented 2 years ago

I like this approach, it will simply the process a lot. The only draw back is the length of the trackID.

lfcnassif commented 2 years ago

There is also the multivalued parentIds property (different from parentId) with all item parents, used to allow fast filtering on file tree. I don't like it very much, because it is redundant and duplicates information. I didn't tested yet the file tree performance with a recursive solution using just the parentId field, maybe keeping it on memory would not have a large performance impact...

hauck-jvsh commented 2 years ago

There is also the multivalued parentIds property (different from parentId) with all item parents, used to allow fast filtering on file tree

I also use it to allow filtering using the file tree in the web interface.

lfcnassif commented 2 years ago

Have you tested an implementation with just parentId, right? Did it have a noticeable performance impact?

hauck-jvsh commented 2 years ago

Have you tested an implementation with just parentId, right? Did it have a noticeable performance impact?

I couldn't make the searches, because I have to filter items that has in their parentdIds the ids of the selected items.

lfcnassif commented 2 years ago

I couldn't make the searches, because I have to filter items that has in their parentdIds the ids of the selected items.

I see, this would need some recursive search, possibly Elastic doesn't have a support for that, but we could try to implement this inside iped...

sepinf-inc / IPED

Improve and rename persistentID to trackID #784