Closed lfcnassif closed 2 years ago
Just let me if you change this name, as it is used in the SARD project.
Just saw we are not using the datasource UUID in the computation. If we include it, this ID would really work like an UUID and will be unique across cases, that would be great. Renaming to globalID then will totally make sense.
parentContainerPersistentID
Just an explanation why this is used. We can have 2 different files with this same path (one allocated and other deleted): /root/a/b.zip/c/d.txt /root/a/b.zip/c/d.txt
The two d.txt files have no idInDatasource (they are subitems from zip, they aren't allocated) and their subitemId could be equal if they were extracted from 2 different b.zip files (one allocated and other deleted). So b.zip globalID (that uses b.zip IdInDatasource) must be used in the computation of d.txt globalID
Just saw we are not using the datasource UUID in the computation. If we include it, this ID would really work like an UUID and will be unique across cases, that would be great.
Thinking better about this, do we need an "UUID" for items in iped? Would it be useful in multicases? This change would make #918 very difficult, I doubt users will know/remember they need to specify the evidence UUID when re-processing cases to import old bookmarks later. Maybe this change to include the evidence UUID in globalID computation can be made just into ElasticSearchTask, what do you think @hauck-jvsh?
I think it could be used only in the elastic ID, I think you could also maintain the persistentID in elastic just not as _id field which must be unique.
I think you could also maintain the persistentID in elastic just not as _id field which must be unique.
This was done to make the --continue option work when resuming a processing to ElasticSearch instead of having to delete a remote index and start the processing from beginning again. I think including the evidence UUID into persistentID/globalID computation should be enough to avoid _id conflicts between elastic cases.
@hauck-jvsh what field are you using to store bookmarks into in elastic, _id?
edit: I mean, to correlate bookmarks to items?
Currently I'm using the _id just to find the item and then set a new metadata with the bookmark in the item.
@hauck-jvsh, I changed the attribute names persistentId->globalId, parentPersistentId->parentGlobalId, parentContainerPersistentId->containerGlobalId, and also ElasticSearchTask contentPersistentId -> contentGlobalId to follow the new naming convention.
After that commit an error is occurring when processing cases, see the log file attached. IPED-2022-01-21-10-25-00.log
Thanks @hauck-jvsh, I'll take a look.
Actually I'm still not convinced about the new globalID attribute name, since it could repeat across cases without including the evidenceUUID in the computation. As you said, we can create a real UUID for items for possible future use in a new attribute (maybe using the globalID name), I like this idea.
But about persistentID renaming, I thought about more options: fixedID, constantID, constID. What do you think? @tc-wleite do have any suggestion?
After that commit an error is occurring when processing cases, see the log file attached. IPED-2022-01-21-10-25-00.log
Processing an E01 image worked, but when I tried to process a folder, got a similar exception here.
But about persistentID renaming, I thought about more options: fixedID, constantID, constID. What do you think? @tc-wleite do have any suggestion?
I was following the discussions around this issue, but I am not sure what would be the best option.
Processing an E01 image worked, but when I tried to process a folder, got a similar exception here.
edited: ok I'll revert that commit until I remember the full details when parentIdInDataSource is needed to compute persistentID, actually I would like to remove that property, that property makes things more complex, but I don't remember why it was implemented at first...
I was following the discussions around this issue, but I am not sure what would be the best option.
Apart from naming, do you think this property should be unique across cases (not needed by iped today) or should repeat when processing the same evidence (so #918 would be easy to implement)?
Well, #918 would be an useful feature. So both properties (unique across cases and repeatable) seem interesting, but is it simple to be implemented?! If it is not that complex, and doesn't add too much overhead (memory/ processing), it seems a good idea, even if these properties are not necessary right now.
They are simple to implement, we could have both of them, let's do it. So, we need just to define the new name for persistentID (it's not friendly).
So, we need just to define the new name for persistentID (it's not friendly).
Wouldn't it be unique across cases now? (so globalId would make sense)
Wouldn't it be unique across cases now? (so globalId would make sense)
No, I didn't change the computation logic, that's why I'm thinking to revert globalID name or to rename to something else...
For now, I'm inclined to constantID
Another option: longID
Ruback's suggestion: trackID or trackingID, I liked it.
So If I understand correctly, there will be two properties, globalId and trackingId?
So If I understand correctly, there will be two properties, globalId and trackingId?
I'm going to rename the original persistentId to trackId. We can add a globalID, it is simple.
I had to do a number of commits to completely avoid and protect against using the item ID before it could be possibly changed/updated. This could happen when subitems are committed without their parents, pointing to an inexistent parentID. When parents are processed again, their ID could be updated to the previous value, keeping the correct parent-child relationships. And item IDs were being used in a number of places before this checking and possible ID updating logic...
Other approaches would be:
The scenario above could also happen with default file system items, files inside a directory could be committed and their parent folder not, depending on threading schedule.
So, all items must have a trackID and parentTrackID to allow checking the parent-child relationships, maybe parentId (int) could be completely replaced by parentTrackId in all places in the future, maybe this would make ID updating not needed anymore.
I like this approach, it will simply the process a lot. The only draw back is the length of the trackID.
There is also the multivalued parentIds property (different from parentId) with all item parents, used to allow fast filtering on file tree. I don't like it very much, because it is redundant and duplicates information. I didn't tested yet the file tree performance with a recursive solution using just the parentId field, maybe keeping it on memory would not have a large performance impact...
There is also the multivalued parentIds property (different from parentId) with all item parents, used to allow fast filtering on file tree
I also use it to allow filtering using the file tree in the web interface.
Have you tested an implementation with just parentId, right? Did it have a noticeable performance impact?
Have you tested an implementation with just parentId, right? Did it have a noticeable performance impact?
I couldn't make the searches, because I have to filter items that has in their parentdIds the ids of the selected items.
I couldn't make the searches, because I have to filter items that has in their parentdIds the ids of the selected items.
I see, this would need some recursive search, possibly Elastic doesn't have a support for that, but we could try to implement this inside iped...
This is an ID that doesn't change between different runs used when resuming processing to skip already processed items. It is built hashing different concatenated IDs: path, idInDataSource (eg. sleuthID, ad1ID, ufdrID), subitemId, parentContainerPersistentID.
Current name isn't intuitive. Although it is not unique across different cases (it is not an UUID), changing it to globalID seems more user friendly. Any other suggestion?