replikativ / konserve

A clojuresque key-value/document store protocol with core.async.
Eclipse Public License 1.0
300 stars 25 forks source link

Returns only nil on windows 10 system #5

Closed kordano closed 7 years ago

kordano commented 7 years ago

Currently testing on Windows 10 a simple fs-store using [io.replikativ/konserve "0.4.6"]:

(def store (<?? S (new-fs-store "/tmp/some_store")))
=> #'user/store
store
=>
#konserve.filestore.FileSystemStore{:folder "/tmp/some_store",
                                    :serializer #konserve.serializers.FressianSerializer{},
                                    :read-handlers #atom[{} 0x5452bb],
                                    :write-handlers #atom[{} 0x80124],
                                    :locks #atom[{} 0x32c145],
                                    :config {:fsync true}}
(<?? S (k/assoc-in store [:a] 1))
=> [nil 1]
(<?? S (k/get-in store [:a]))
=> nil
whilo commented 7 years ago

Can you inspect the files in the folder before and after the assoc-in? There should be exactly one file for the key :a containing the value 1 in fressian encoding, probably two bytes long or so. Also is there some error in the server-error buffer? konserve internally does not use the supervisor, so it can be used with plain core.async. Maybe some error slipped in there (but it shouldn't)?

whilo commented 7 years ago

Also note that the current version is 0.4.8.

whilo commented 7 years ago

Btw. is your path valid? On Windows you have to use a drive letter, I think. It is still a bug that konserve doesn't throw an error then.

oahner commented 7 years ago

I ran into this issue on Windows 8 as well and did some digging; looks like File.renameTo fails, leaving the store file's filename with a trailing '.new'. In the example above, if I manually rename the file before running (<?? S (k/get-in store [:a])), it properly returns 1.

whilo commented 7 years ago

@oahner Thanks a lot. I have pushed a new version to master, can you try whether the Files.move() fix works?

oahner commented 7 years ago

That was quick! Unfortunately no, Files.move is also unable to move an open file on Windows. Looks like the FileOutputStream.close call will need to happen first, but then we lose the last fsync.

Would closing, moving, re-opening, fsyncing and re-closing work? Or is the ATOMIC_MOVE flag on Files/move equivalent to fsync?

whilo commented 7 years ago

Probably. I will try that hopefully at the weekend. The code should be straightforward to hack though, if you can spare some time ;).

oahner commented 7 years ago

Well, I took a crack at it and, while the code is certainly straightforward, figuring out what APIs provide durability guarantees in a cross-platform way really is quite a headache.

As far as I can tell from the openjdk source, Files.move eventually calls MoveFileEx but there's no way to pass it the MOVEFILE_WRITE_THROUGH flag that would make it flush to disk before returning. I haven't found another Java API that allows this but I suppose it could be done directly through JNI if there are no other alternatives.

Unfortunately, the above research uncovered a few issues with fsync, namely, that fsync'ing the file is not enough and the folder needs to be fsync'd as well (fact that konserve seems aware of since there's a commented out fsync to the folder in the update-in implementation).

I'm still trying to make sense of this but I may not be able to continue until next week, so here's a few interesting links for future reference (fair warning, the rabbit hole is deep):

whilo commented 7 years ago

Yes, I haven't come to do the fsync properly for the directory yet, as you mention it is far from being clear what you need to do to properly ensure durability. Yes the filesystem rabbit hole is deep and I think they are a mess as an interface.

I have no experience with JNI yet and don't have a windows machine, but @kordano might be able to check. @oahner If you have ideas beyond file moving for atomicity, then we could also change the implementation.

whilo commented 7 years ago

@oahner And thanks for diving into this! I hope we can nail it down somehow.

eslachance commented 7 years ago

Will add this back in in case it is actually relevant: I attempted to use this from a Ubuntu VM, which was accessing the Windows Host's drive through a mount. I thought that was clever, but I had exactly the same issue with the .new not being removed and get-in returning nil.

I'm unable to test an NTFS drive directly from Linux without being in a VM at the moment, so I'm not sure if this means it's an NTFS issue or still just a Windows issue (virtualbox being a Windows software after all). The specificity of hardware interactions from within a VM is a bit beyond me but I thought I'd let you know. Perhaps someone can test with an NTFS partition?

whilo commented 7 years ago

@oahner @eslachance I have added fsync'ing for the directory and close the file before moving now. @kordano has verified that it works on Windows now and it is as 0.4.10 on clojars. Please point out any remaining issues. The file-store also does non-blocking reading now finally, so you also get more efficiency :).

whilo commented 7 years ago

I'll close the issue for now. Please open a new one, if you are still affected!