Open ostafen opened 2 years ago
Hey! I've thought about this and I absolutely see the value of not damaging the database file when the process dies during a write operation. But I also see a cost in that creating a temporary file may not be something that application authors have considered when writing their applications. E.g. I can imagine this leading to some sort of weird permissions error where the running can't create new files in the directory of the database file.
As a compromise my thought was that this could be a new storage backend that is included with TinyDB, named something like SafeJSONStorage
for example. This would also include documentation that describes the upsides and downsides of both this storage and the regular JSONStorage
(similar to yaml.safe_load
vs yaml.load
).
What do you think?
Hi, Markus, thanks for your reply. Regarding the idea of using two storages, I think that crash recovery is something that each persistent database, even the simplest, should deal with. I can't imagine any application which persists data on disk where damaging the entire content of the database could be acceptable. If this is acceptable for your application, probably you application could store data in-memory. So I think it's better to integrate crash recovery on the existing JsonStorage.
Moreover, I think that applications authors who are really concerned about performance, wouldn't probably choose TinyDB: if we consider that the library focuses on simplicity, the performance increase is quite tolerable.
However, a second solution is possible, if you want to optimize this aspect, which would be more performant, that is to say the double write technique. It avoids to create the temporary file at each write. It consists in maintaining two files all the time, a main db file, and a temporary db file. Each time you have to write the database on disk, you first truncate to zero length the temporary file and write the entire database to it. At this point, you are sure that, if a crash happens, you can restore the database from the temporary file, so you can proceed to write data also to the main db file. This means that, on application startup, you have to check that the main db file is not damaged. If this is the case, you can restore a valid version of the database from the temp file (which is for sure not damaged), otherwise you proceed normally. I hope I have been able to explain myself.
Let me know what you think about this :-)
EDIT: This is not related to this issue. But, what do you think of storing each collection in a separate json file? It could improve performance a lot, without affecting too much the simplicity of the library.
Thanks for your input, @ostafen! I haven't forgotten about this and will reply in the next few days! 🙂
Thanks again for your input, @ostafen!
I’ll admit that I’m really torn about this. Let me explain:
On the one hand, I absolutely agree that software should strive towards reliability even in challenging circumstances like application crashes or power outages. If I was to write TinyDB from scratch a second time, I'd seriously consider shipping the default JsonStorage
with atomic writes.
But on the other hand, what I’m most worried about with a change like this is introducing subtle and hard-to-catch bugs. My main concern is that making TinyDB write a temporary file in the directory of the database file might break user’s assumption that TinyDB only tries to access the file it is told to access. Maybe I’m a little paranoid here, but I’ve had to debug issues that arose from libraries changing stuff like this quietly in a minor or patch release. These experiences made me cautious about changes that do not directly lead to errors but rather invalidate assumptions about how the library works under the hood and thus leads to subtle breakage where other programs rely on these assumptions.
All that being said, I’m open to have my mind changed. I’d really appreciate input from other users and people from the community if I’m too cautious here and we should just implement these changes, or if we should go for a compromise like the SafeJsonStorage
I described that fixes the data corruption problem but makes clear that the assumptions about how it works have to change.
What do you think of storing each collection in a separate json file? It could improve performance a lot, without affecting too much the simplicity of the library.
I think this would break user assumptions even more than writing a temporary file altough I’d love to see a community-based extension that adds a storage backend like this!
@msiemens, I don't really see why an user should make assumptions about the database layout on disk. A part from being an internal detail of the library, I don't see how an user could complain about making the engine more robust. As a matter of facts, the user only makes assumptions about APIs. Moreover, as far as I know, the rename solution is a standard strategy to safely save a file on disk.
However, I understand your point of view, especially if you are not familiar with this solution, and I don't want to influence your opinion about this. If you change idea, or simply want to integrate this as a separate Storage, let me know :=)
I don't really see why an user should make assumptions about the database layout on disk.
Actually I remember people asking questions about the database layout becase folks wanted to process TinyDB database files in external programs 😉 Altough unfortunately I can't find these discussions right now.
But to the original question: I think I suffered some sort of analysis paralysis with this decision. Looking at this again I think we should go ahead with merging your PR and then document this feature properly (in the release notes and in the documentation).
I got stuck with thinking about this because TinyDB as a project has way (way!) more users than I ever expected or dreamed and sometimes I'm a little scared about inadvertantly breaking stuff for some people. If the PyPI stats are to be believed, TinyDB has had more than 8 million downloads in total (although a lot will come from PyPI mirrors and CI pipelines). With a scale like this I'm sure that a lot of really rare edge cases will trigger just through sheer scale 🙈 But I think if we inform users in the relase notes they'll be able to figure out what's happening if they run into issues 🙂
You are right to be careful with this kind of things. Anyway, as far as I'm concerned, this is a standard technique. I've seen it being used in a lot of projects (for example, several databases use this to safely save metadata files).
Moreover, notice that we are not changing the way TinyDB stores files on disk, because we are only introducing an additional temporary file which is only used before writing to the real file. As a consequence, any user which may want to access the TinyDB .json
file will not make any additional assumption that he would not have made before: he will find its data in the .json
file as before, never having to know that the temporary file even exists.
On the contrary we are giving those users the additional warranty they will never read incomplete files, which actually may happen in case of crash
If you want to avoid disturbing users, you could create a new major version or make this optional in a minor version with a warning and make it the default in next major version.
Hi, I noticed that the JsonStorage uses the write method to rewrite the entire file. However if a crash occurs while the file is being written, the db could be damaged. Why don't write to a temporary file and rename? You can check what I mean in PR #468 :)