mwotton / judy

Other
18 stars 4 forks source link

Unsafe IO for keys, elems, toList #2

Closed arthurl closed 8 years ago

arthurl commented 8 years ago

unsafeInterleaveIO in the functions keys, elems and toList is actually unsafe.

>>> j <- J.new :: IO (J.JudyL Int)
>>> J.insert 0 10 j
>>> l <- J.keys j
>>> J.insert 1 11 j
>>> l
[0,...(garbage output)

I'm not sure how to solve this problem. I've just added some warnings to the documentation in this pull request.

The only thing I can come up with is to ask the user to provide a function of type [(k, v)] -> IO b and fully evaluating b, so that all lazy evaluation can be done while holding the mutex lock. However, it sounds terribly inelegant. Moreover, one has to manually ensure that the array is not referenced in that function, otherwise deadlock will result.

Having said all that, please don't remove toList; serialising the array to file would be a nightmare.

Arthur

mwotton commented 8 years ago

Thanks! Reckon you can upload it as a failing test?

On Tue, 16 Feb 2016 8:09 am Arthur notifications@github.com wrote:

unsafeInterleaveIO in the functions keys, elems and toList is actually unsafe.

j <- J.new :: IO (J.JudyL Int) J.insert 0 10 j l <- J.keys j J.insert 1 11 j l [0,...(garbage output)

I'm not sure how to solve this problem. I've just added some warnings to the documentation in this pull request.

The only thing I can come up with is to ask the user to provide a function of type [(k, v)] -> IO b and fully evaluating b, so that all lazy evaluation can be done while holding the mutex lock. However, it sounds terribly inelegant. Moreover, one has to manually ensure that the array is not referenced in that function, otherwise deadlock will result.

Having said all that, please don't remove toList; serialising the array to file would be a nightmare.

Arthur

You can view, comment on, or merge this pull request online at:

https://github.com/mwotton/judy/pull/2 Commit Summary

  • Warn unsafe IO for keys, elems, toList in docs

File Changes

Patch Links:

— Reply to this email directly or view it on GitHub https://github.com/mwotton/judy/pull/2.

ketil-malde commented 8 years ago

Would it be an option/solution to provide 'freeze' and 'thaw' operations, like for MArrays? With toList etc working on pure versions. A bit more involved than one would want, I suppose.

mwotton commented 8 years ago

possibly. i'll have to have a look at it later, I don't yet really understand the failure mode.

On Tue, Feb 16, 2016 at 9:50 AM ketil-malde notifications@github.com wrote:

Would it be an option/solution to provide 'freeze' and 'thaw' operations, like for MArrays? With toList etc working on pure versions. A bit more involved than one would want, I suppose.

— Reply to this email directly or view it on GitHub https://github.com/mwotton/judy/pull/2#issuecomment-184711194.

ketil-malde commented 8 years ago

I don't yet really understand the failure mode.

I may be misunderstanding you here, but the problem appears to me to be that you (or rather, arthurl) start a lazy extraction of values, but then modify the underlying Judy array. Presumably the lazy chunk holds pointers into Judy-land that get invalidated by the update, and chaos ensues when the extraction continues to evaluate.

The way I use it (in my kmx tool, btw: [http://biohaskell.org/Applications/kmc]()), I build the index entirely, and then write it out using toList - so I think that's a perfectly safe use.

Another way to solve this might be to have the lazy extraction hold on to the mutex until done, but that would probably lead to deadlocks or other problems.

arthurl commented 8 years ago

Thanks! Reckon you can upload it as a failing test?

Done!

arthurl commented 8 years ago

I may be misunderstanding you here, but the problem appears to me to be that you (or rather, arthurl) start a lazy extraction of values, but then modify the underlying Judy array. Presumably the lazy chunk holds pointers into Judy-land that get invalidated by the update, and chaos ensues when the extraction continues to evaluate.

Yes, that's what I thought too; the main issue being that the old value of the pointer (i.e. r in the code) to the Judy array is used when reading the next value despite it being "floated out" by unsafeInterleaveIO.

arthurl commented 8 years ago

Would it be an option/solution to provide 'freeze' and 'thaw' operations, like for MArrays? With toList etc working on pure versions. A bit more involved than one would want, I suppose.

While that would solve the immediate problem, I'm not sure I would agree that it is the way forward. After all, one of the primary uses of lazy lists isn't that anyone actually wants a linked list, but rather to exploit list fusion to allow various reduction (i.e. fold) operations to run efficiently in constant space. From that perspective, copying the Judy array is counterproductive, although it is certainly true that copying the array is still cheaper than actually having the entire linked list evaluated.

Just my two cents; not that I have any good suggestions.

ketil-malde commented 8 years ago

From that perspective, copying the Judy array is counterproductive, although it is certainly true that copying the array is still cheaper than actually having the entire linked list evaluated.

You don't need to copy it, you can have unsafeFreeze, like for arrays - but then the onus is on the user to ensure that nothing else is modifying the data structure elsewhere (and the 'unsafe' in the name kinda draws attention to that).

But if you want toList to be a safe and lazy generator, I don't think you have any other choice than to make the underlying data structure immutable. (Or at least, some sort of copy-on-write that only shares the unchanged bits.)

-k

If I haven't seen further, it is by standing in the footprints of giants

arthurl commented 8 years ago

but then the onus is on the user to ensure that nothing else is modifying the data structure elsewhere (and the 'unsafe' in the name kinda draws attention to that).

True, but in that case isn't it easier to just rename keys to keysUnsafe and plaster huge warning signs in the docs?

I'm starting from the assumption that a solution should make it literally impossible, at the type level, for anyone using the library to get such an error. (Copying the array doesn't really count as a solution.) Still, as you say, I don't see a way. Then again, I'm not that experienced.

mwotton commented 8 years ago

options:

  1. doc as unsafe
  2. thawing/freezing (but as Arthur mentioned, a copy is a bit counterproductive)
  3. ST monad to prevent escape (safe but a bit awkward)

Thoughts? You two seem to be using the library considerably more heavily than I am.

ketil-malde commented 8 years ago
  1. doc as unsafe

Certainly.

  1. thawing/freezing (but as Arthur mentioned, a copy is a bit counterproductive)

I'm not going to say it's the only safe option, but it's the one I (think) I understand. Copy is only needed if you want to safely have both a lazy traversal, and allow new updates.

unsafeFreeze :: JudyL -> IO JudyImmutable usafeFreeze = return . JudyIm -- look, ma, no copy!

toList :: JudyImmutable -> [(Word,Word)] -- doesn't really need to be -- in IO either, does it?

'freeze' would need to copy, but at least it would give you the safe option. It's clunky, true, but it is also more explicit about safety, adheres to tradition (inasmuch anybody still uses arrays, and not just vectors).

  1. ST monad to prevent escape (safe but a bit awkward)

Would this make a difference from the current implementation (in IO)? At least for arrays, you still have freezing and thawing of STArray.

ST might be a good idea anyway, as you can create safe, pure code using Judy arrays. But in practice, IO is okay.

Thoughts? You two seem to be using the library considerably more heavily than I am.

Only very limited bits of it. It's no big deal, it works great for me as it is - but I'm slightly uncomfortable about the inherent unsafety.

-k

If I haven't seen further, it is by standing in the footprints of giants

mwotton commented 8 years ago

the ST approach would require pretty serious type changes - all the updates at the very least would have to be ST-only, and JudyL would have to be inaccessible outside the ST context. I think you can make it safe and pure. I will say I don't really have the bandwidth to take this on right now.

unsafeFreeze does seem sensible - at least then you have explicitly accepted that something partial might happen if you hold it wrongly.

arthurl commented 8 years ago

doc as unsafe

Yes definitely, and —

unsafeFreeze does seem sensible - at least then you have explicitly accepted that something partial might happen if you hold it wrongly.

I see this option as doing exactly the same as the first, because in principle this changes nothing, and depends on the user to ensure that unsafeInterleaveIO doesn't float beyond the next modification of the Judy array.

I originally thought that it would be simpler to just rename keys to keysUnsafe if you really wanted the word 'unsafe' in the function name, but now I think I agree with ketil-malde, simply because it's easier for a fresh user to see unsafeFreeze and understand the above than to try explain what exactly constitutes safe usage of the function. In short, a documentation consideration.

the ST approach would require pretty serious type changes - all the updates at the very least would have to be ST-only, and JudyL would have to be inaccessible outside the ST context. I think you can make it safe and pure. I will say I don't really have the bandwidth to take this on right now.

A pure interface would be really awesome, but like ketil-malde, IO is in practise alright too.

Actually, this sounds like an interesting problem, although I don't think I have the technical expertise for this at this point. If you have time, perhaps you could do a small brain dump of where to start, for the benefit of myself and whoever next comes along? (e.g. a one or two liner on relevant articles / other libraries with similar implementations / rough sketch of the expected semantics / ideas / whatever)

mwotton commented 8 years ago

On Wed, 17 Feb 2016 9:41 am Arthur notifications@github.com wrote:

doc as unsafe

Yes definitely, and —

unsafeFreeze does seem sensible - at least then you have explicitly accepted that something partial might happen if you hold it wrongly.

I see this option as doing exactly the same as the first, because in principle this changes nothing, and depends on the user to ensure that unsafeInterleaveIO doesn't float beyond the next modification of the Judy array.

It's a bit different, because you'd also provide 'freeze :: JudyL -> IO JudyImmutable' which would copy, and keys would only work on JudyImmutable. Basically, no surprises unless you call unsafe-tagged functions. (Well, it might take more memory than you expect, but no crash.)

the ST approach would require pretty serious type changes - all the updates at the very least would have to be ST-only, and JudyL would have to be inaccessible outside the ST context. I think you can make it safe and pure. I will say I don't really have the bandwidth to take this on right now.

A pure interface would be really awesome, but like ketil-malde, IO is in practise alright too.

Actually, this sounds like an interesting problem, although I don't think I have the technical expertise for this at this point. If you have time, perhaps you could do a small brain dump of where to start, for the benefit of myself and whoever next comes along? (e.g. a one or two liner on relevant articles / other libraries with similar implementations / rough sketch of the expected semantics / ideas / whatever)

Have a look at STArray. model is basically the same, we could do worse than copy the semantics. There might be an ST Vector too? unsure, haven't looked.

so, it's pretty common practice to have a low-level, C-faithful API and a higher-level safe one. It may be worth keeping this one simple (using the freeze and unsafeFreeze calls), then writing a new judy-pure library on top.

The basics are that the ST monad lets you initialise with an instance of something, but won't let it escape. You could use this property to make sure no judy array ever escapes to the outside world: this means there are no future array update operations possible, which I think means that the unsafeInterleaveIO becomes safe, as it's only unsafe in the presence of further mutation.

Ketil, can you sanity-check my reasoning please?

cheers mark

ketil-malde commented 8 years ago

The basics are that the ST monad lets you initialise with an instance of something, but won't let it escape.

As far as I understand, this is correct. I don't understand much of how it works internally, but I've used it a couple of times, and I just think of it as a cut-down IO that allows references and mutability, without any of the other IO stuff. So you use

runST

instead of

unsafePerformIO

and at least you know that there isn't any crap going on involving files or launching missiles, or something like that. The modern solution might be to use something from the monad transformers camp, but I'm very rusty on all things Haskell these days.

I don't think ST is used much, so not sure I'd make it a priority, and I'm not at all convinced is can be used to interface to C code.

Ketil, can you sanity-check my reasoning please?

As far as I can tell, you're spot on. (But that's admittedly not very far. Footprints of giants, and all that).

-k

If I haven't seen further, it is by standing in the footprints of giants

mwotton commented 8 years ago

@ketil-malde i can't see why you couldn't use it to interface to C, if you've got access to arbitrary IO stuff :) just shifts the burden of proof to the library author.

ketil-malde commented 8 years ago

@ketil-malde i can't see why you couldn't use it to interface to C, if you've got access to arbitrary IO stuff :) just shifts the burden of proof to the library author.

In ST, you only have access to IO via unsafePerformIO, I believe. So, yes, you can do it if you feel confident, but it sounds rather dangerous, and difficult to be sure you have tested all the possible cases. Just let it live in IO, I'd say. (I tried to Google it, but couldn't find any other C-interface libs using ST.)

-k

If I haven't seen further, it is by standing in the footprints of giants

mwotton commented 8 years ago

That's what quickcheck's for :) anyway, it seems to be working now. I ended up changing the extra test cases to fit the new types, @arthurl, but they were helpful for tracking it down, thank you.

mwotton commented 8 years ago

i'll call this one 0.3.0, as it breaks keys/elems.