Closed manuelma closed 4 months ago
Here's a modified proposal. We can do pretty much the way you're proposing and it will probably help in many cases, but I think it still has problems. Below the chart, there are a list of options to deal with the conflict situation. I still think adding a datetime would be helpful in conflict resolution because we could default to choosing the latest version of the same item. (User could of course be able to choose the behaviour, but we don't need to implement that right now - we just need something that works better).
flowchart TD
A[A: Fetch a new item from the DB] --> B{"Is there a mapped-item
with the same unique key?"}
B -->|Yes| D
D{"Is the id of the DB-item same
as the id of the mapped-item?"} -->|Yes| Z[Z: Keep mapped-item]
D -->|No| E{"mapped-item has been
added in this session?"}
E -->|Yes| F["F: Would benefit from
user interaction*"] --> Z
E -->|No| H["H: Does it matter that
the item has not been
added in this session?
Would benefit from
user interaction*"] --> Z
B -->|No| J{"Is there a mapped-item with the id
of the fetched DB-item"}
J --> |Yes| N["N: Change the mapped-item id
to the next free id"]
J --> |No| L["L: Add the feched item
to memory"]
'* There are some options: 1) Keep it simple: keep the mapped-item, drop the DB item. This will in some cases destroy data. 2) Make a new unique key and create a new item. 3) Check if the item contents are the same. If they are, the mapped-item could be either dropped (replaced by the item from DB) or it could replace the item from DB (re-committed). If the contents are different, then the user should be informed so the user can choose. 4) Add datetime information to db and memory that can be used to make a better informed automatic choice (e.g. take the latest).
Even that leftmost yes - yes path would benefit from the datetime - it could be a further check that the items are truly the same.
Thank you very much for this @jkiviluo ! Although, my algorithm was more intended for the code reviewer (maybe @soininen ?) so they know where the different things come from - and the high-level reasoning. I don't think your proposal is "implementable", it misses some key aspects of the code, notably the fact that an id-mismatch is not a conflict and needs to be solved internally. It doesn't make sense asking the user which id they prefer to use because they shouldn't know about ids in the first place.
Some more comments below (I added letters to the boxes in your diagram to refer to them more easily):
I think you raise some good points on conflict resolution though, and we should consider them in another PR.
Edit: To clarify, when I said that "an id-mismatch is not a conflict and needs to be solved internally" I was a bit loose. It is not seen as a conflict in this PR, because this is just focused on fixing internal conflicts due to items changing their ids. But on top of the id-mismatch or even if no id-mismatch, there could be user-data conflicts due to e.g. classes changing their description or parameters changing their values, and those need to be fixed and do benefit from user interaction as @jkiviluo was pointing out, but we will do it in another PR. (For now, we always favour the item in memory.)
@jkiviluo was thinking of these user-data conflicts when drawing his flowchart so it can be very useful when we start that work.
I have not yet reviewed the code, so maybe this point is moot. Do we need to communicate freeing the database id to client code (e.g. Database editor) somehow? We invented the TempId
so we did not need to deal with that and that is why I invented the negative ids in #333 in the first place.
I have not yet reviewed the code, so maybe this point is moot. Do we need to communicate freeing the database id to client code (e.g. Database editor) somehow? We invented the
TempId
so we did not need to deal with that and that is why I invented the negative ids in #333 in the first place.
Thank you @soininen - freeing the DB-id means popping any item that has it, change its id to a TempId, and then reinjecting it. I am hoping that works, but I don't know what you mean by communicating freeing the id to the client... sounds important, maybe you can expand?
Thank you @soininen - freeing the DB-id means popping any item that has it, change its id to a TempId, and then reinjecting it. I am hoping that works, but I don't know what you mean by communicating freeing the id to the client... sounds important, maybe you can expand?
As you know, SpineDatabaseManager.get_item()
relies on getting items by their id. So if the id changes in DatabaseMapping
, then we need a way to tell that to the manager, otherwise it will return the wrong (or maybe even non-existent) item. If we clear the models and undo stacks on refresh and reload all data with new ids, then this is not an issue, of course.
Thank you @soininen - freeing the DB-id means popping any item that has it, change its id to a TempId, and then reinjecting it. I am hoping that works, but I don't know what you mean by communicating freeing the id to the client... sounds important, maybe you can expand?
As you know,
SpineDatabaseManager.get_item()
relies on getting items by their id. So if the id changes inDatabaseMapping
, then we need a way to tell that to the manager, otherwise it will return the wrong (or maybe even non-existent) item. If we clear the models and undo stacks on refresh and reload all data with new ids, then this is not an issue, of course.
Ah, right, we store the id in the Qt model, so we cannot just change them without consequences. Thanks for the explanation, I think that is indeed a problem here.
I just realized we may run into the issue of conflicting ids also when fetching more, not only when refreshing. We need to be able to deal with that as well. Perhaps DatabaseMapping would just refuse to fetch more if there are new commits?
I just realized we may run into the issue of conflicting ids also when fetching more, not only when refreshing. We need to be able to deal with that as well. Perhaps DatabaseMapping would just refuse to fetch more if there are new commits?
Good point. I am trying to make it work with fetch more too, let's see if I manage. Otherwise I definitely agree we might need to give up at some point, there are many difficult things here.
To spar this PR a bit I put the following test into TestDatabaseMapping
:
def test_fetching_entities_after_external_change_has_renamed_their_classes(self):
with TemporaryDirectory() as temp_dir:
url = "sqlite:///" + os.path.join(temp_dir, "db.sqlite")
with DatabaseMapping(url, create=True) as db_map:
self._assert_success(db_map.add_entity_class_item(name="Widget"))
self._assert_success(db_map.add_entity_class_item(name="Gadget"))
self._assert_success(db_map.add_entity_item(entity_class_name="Widget", name="smart_watch"))
widget = db_map.get_entity_item(entity_class_name="Widget", name="smart_watch")
self.assertEqual(widget["name"], "smart_watch")
db_map.commit_session("Add initial data.")
with DatabaseMapping(url) as shadow_db_map:
widget_class = shadow_db_map.get_entity_class_item(name="Widget")
widget_class.update(name="NotAWidget")
gadget_class = shadow_db_map.get_entity_class_item(name="Gadget")
gadget_class.update(name="Widget")
widget_class.update(name="Gadget")
shadow_db_map.commit_session("Swap Widget and Gadget to cause mayhem.")
db_map.refresh_session()
db_map.fetch_all("entity")
gadget = db_map.get_entity_item(entity_class_name="Gadget", name="smart_watch")
self.assertEqual(gadget["name"], "smart_watch")
In the test, shadow_db_map moves smart_watch from one entity class to another one by swapping the class names. The test fails at the last assert because this change is not seen by db_map. Is this the expected outcome? Should we be able to deal with such situations?
To spar this PR a bit I put the following test into
TestDatabaseMapping
:def test_fetching_entities_after_external_change_has_renamed_their_classes(self): with TemporaryDirectory() as temp_dir: url = "sqlite:///" + os.path.join(temp_dir, "db.sqlite") with DatabaseMapping(url, create=True) as db_map: self._assert_success(db_map.add_entity_class_item(name="Widget")) self._assert_success(db_map.add_entity_class_item(name="Gadget")) self._assert_success(db_map.add_entity_item(entity_class_name="Widget", name="smart_watch")) widget = db_map.get_entity_item(entity_class_name="Widget", name="smart_watch") self.assertEqual(widget["name"], "smart_watch") db_map.commit_session("Add initial data.") with DatabaseMapping(url) as shadow_db_map: widget_class = shadow_db_map.get_entity_class_item(name="Widget") widget_class.update(name="NotAWidget") gadget_class = shadow_db_map.get_entity_class_item(name="Gadget") gadget_class.update(name="Widget") widget_class.update(name="Gadget") shadow_db_map.commit_session("Swap Widget and Gadget to cause mayhem.") db_map.refresh_session() db_map.fetch_all("entity") gadget = db_map.get_entity_item(entity_class_name="Gadget", name="smart_watch") self.assertEqual(gadget["name"], "smart_watch")
In the test, shadow_db_map moves smart_watch from one entity class to another one by swapping the class names. The test fails at the last assert because this change is not seen by db_map. Is this the expected outcome? Should we be able to deal with such situations?
Nice one, thank you @soininen - I believe the above is equivalent to Widget and Gadget swaping their ids. Yes, we should be able to deal with it - I will add the test to the PR and try to make it work.
@soininen if you have more ideas for tests please let me know. I managed to make the above test work by force-fetching references whenever the DB has external commits. This means that fetching entities will fetch the classes in that case, and then we are able to figure out that Widget and Gadget have swapped.
if you have more ideas for tests please let me know
Sure! I wrote some tests for #333. I can try salvaging some of those to see how they run in this branch later today.
if you have more ideas for tests please let me know
Sure! I wrote some tests for #333. I can try salvaging some of those to see how they run in this branch later today.
Sounds great.
Attention: 8 lines
in your changes are missing coverage. Please review.
Comparison is base (
a4313d5
) 77.46% compared to head (0f33669
) 77.56%. Report is 1 commits behind head on 0.8-dev.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I copied the tests I deemed relevant to a new branch tests_for_fix_conflicts
making a few modifications to get them compatible with the fix_conflicts
branch. Feel free to merge tests_for_fix_conflicts
into fix_conflicts
.
Even though the tests were originally written for a conflict resolution with different philosophy than what we have here, I believe they are still beneficial. Most of the tests fail in fix_conflicts
. Below are some of my quick observations:
test_restoring_entity_whose_db_id_has_been_replaced_by_external_db_modification
In this test "other_entity" appears twice in all_items
which breaks uniqueness constraint.
test_cunning_ways_to_make_external_changes
Only one of the parameter values added by shadow_db_map
ends up as mapped item in db_map
test_adding_same_parameters_values_to_different_entities_externally
Similarly to test_cunning_ways_to_make_external_changes, the value added by shadow_db_map
does not show up in db_map
.
test_update_entity_metadata_externally, test_update_parameter_value_metadata_externally, test_update_entity_alternative_externally, test_update_superclass_subclass_externally
The two items at the end of the tests have the same ids.
test_committing_changed_purged_entity_has_been_overwritten_by_external_change
Looks like only one entity gets committed to the database while two were expected.
That's gas @soininen - thank you so much. I will merge and fix asap.
I forgot to mention: some of the ids in the tests may be incorrect. #333 used negative ids so I just removed the minus signs but it was not possible to verify the correctness if the tests failed before the asserts that include ids.
I like how this is looking now, @soininen's tests are passing after some adaptations but I think they still test what they were intended to test.
I ended up replacing all DB ids with TempIds in the end, much like @soininen did in the reverted PR, but I kept the TempId
class because I find it convenient. Although negative ids plus a lookup dictionary from temp_id to db_id would have worked the same. This allows clients to store ids (as Spine-Toolbox does plenty) without them ever becoming invalid.
One improvement (I think) is that now we replace the db-ids by TempIds in referees at item validation time, not at construction time, which probably prevents the KeyErrors we were seeing with the reverted PR.
Id replacement is also more integrated and doesn't require the additional _id_fields
class attribute for MappedItem
(it uses the existing _external_fields
plus _references
attributes).
Maybe another advantage of this compared to the reverted PR is that we do less. We do not try and solve user-data conflicts, but just internal id-conflicts. After we merge this, I would like to try and resurrect the reverted code for conflict resolution that was very elegant. (Actually, did the reverted PR attempt to solve user-data conflicts at all?)
I want to spend some time using spinetoolbox with a complex project I have before merging, to gain more confidence. Any additonal feedback is more than welcome @jkiviluo @soininen
One drawback with the TempId
is that you cannot construct them "at will" since the _next_id
counter is increased automatically at each initialization. Client code needs to resort to tricks if it wants to construct an id and use that to refer to a mapped id, see TestSpineDBFetcher._import_data()
for an example. Because of this, I prefer the negative ids and their 'less integrated' approach.
However, I am unsure if that is really as big problem as I make it seem. Toolbox uses the ids because historical reasons and I think it is better for more modern code to use unique keys to access the mapped data.
Anyway, great work with the conflict resolution! Hopefully no too big issue will rise in Toolbox testing!
Very nice progress!
Just a comment on id's: I don't think they are there just for historical reasons - they are more efficient (both speed and space) when the items are repeated in multiple tables (like entities in parameter_value table). Or maybe I'm just missing the context here...
Or maybe I'm just missing the context here...
No, you're right. Lookup by id is faster.
Or maybe I'm just missing the context here...
No, you're right. Lookup by id is faster.
Totally right, we are not just coping with having ids in the DB, we do want to have them.
I found another issue: say I purge a DB, commit, then another client adds some of the same items I just purged. With the current 'strategy' where changes in the mapping always prevail, I will never see the items added by the other client because I removed them and my changes 'win'.
But I believe I should see the items, because they were added back after I committed - so they are newer.
So I propose we 'tweak' the strategy a little bit, so that newer changes win. In other words, if my changes are uncommitted, then they always win because they eventually belong to a future commit. But if they are committed, the changes I fetch from the DB win, because they necessarily belong to a commit that happened after mine.
Fixes 'id conflicts' between in-memory mapping and new contents of the DB. This is not about resolving conflicts on user-data, that should be done in another PR.
The core idea is, if a DB-item has the same unique key as a mapped-item, then they also need to have the same id. Conversely, if a DB-item has the same id as a mapped-item, then they need to have the same unique keys. The id to trust and respect will always be the most recent from the DB.
Algorithm
The Free DB-item id bit consists in making sure that no mapped-item has the same id as the DB-item. If that's the case, we need to decide what to do with the clashing mapped-item, as we cannot keep it with its current id. This is the only point that requires conflict resolution*. At the moment we 'rescue' the mapped-item, which means we make it pending for addition with a new id - but we could use a different strategy or let the user choose between a number of strategies.
@soininen @jkiviluo you could start reviewing this if you will. I still need to polish things and especially improve unit-tests. One of the old tests is not passing. I have added two new tests but need to finalize one of them and add more to test all the different cases.
NOTE: In element I was proposing a different idea which consisted in using a tuple (last-commit-id, db-id) instead of the db-id, to avoid ambiguities, but that doesn't work. We need to resolve ambiguities and the above algorithm works much better to that end.
NOTE2: One important detail for the implementation is, we need to be able to find mapped items by unique key even if they have been removed from the memory-mapping. For this reason, the
_MappedTable._ids_by_unique_key_value
dictionary now also includes removed items, and we check if the items in it are removed or not depending on what we are using them for.Checklist before merging