mila-iqia / fuel

A data pipeline framework for machine learning
MIT License
867 stars 268 forks source link

Resuming blocks training with manually opened H5PYDataset fails #343

Closed dwf closed 8 years ago

dwf commented 8 years ago

"No open file handle".

rizar commented 8 years ago

Can you please provide a bit more information? What does "manually opened" mean in this context?

dwf commented 8 years ago

Sorry, by that I mean "not a wrapper class". But in fact it has more to do with load_in_memory True vs. False. If it's False, the file handle never gets opened.

There's a simple solution in the form of making _out_of_memory_open() just open a file-handle is the object doesn't have one. Is there a more general solution?

rizar commented 8 years ago

Hmm... looks like I have already experienced this issue. Just a sanity check, are you using the latest Fuel?

On 13 April 2016 at 14:03, David Warde-Farley notifications@github.com wrote:

Sorry, by that I mean "not a wrapper class". But in fact it has more to do with load_in_memory True vs. False. If it's False, the file handle never gets opened.

There's a simple solution in the form of making _out_of_memory_open() just open a file-handle is the object doesn't have one. Is there a more general solution?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/mila-udem/fuel/issues/343#issuecomment-209571035

dwf commented 8 years ago

Yep.

On Thu, Apr 14, 2016 at 1:25 PM, Dzmitry Bahdanau notifications@github.com wrote:

Hmm... looks like I have already experienced this issue. Just a sanity check, are you using the latest Fuel?

On 13 April 2016 at 14:03, David Warde-Farley notifications@github.com wrote:

Sorry, by that I mean "not a wrapper class". But in fact it has more to do with load_in_memory True vs. False. If it's False, the file handle never gets opened.

There's a simple solution in the form of making _out_of_memory_open() just open a file-handle is the object doesn't have one. Is there a more general solution?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/mila-udem/fuel/issues/343#issuecomment-209571035

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/mila-udem/fuel/issues/343#issuecomment-210062422

rizar commented 8 years ago

OK, I reproduced the issue locally. I think it is kind of weird that we pickled an opened dataset but then it is not open after unpickling.

dwf commented 8 years ago

Well, you can't pickle a filehandle, and opening immediately upon unpickling is dangerous since if the file doesn't exist, you now can't unpickle your model.

I think an acceptable solution is to make it behave the same as when load_in_memory=True (this should really be the case if we are aiming to make the in-memory/out-of-memory distinction transparent), and auto-open the file handle if one is not open when get_data is called.

On Thu, Apr 14, 2016 at 1:42 PM, Dzmitry Bahdanau notifications@github.com wrote:

OK, I reproduced the issue locally. I think it is kind of weird that we pickled an opened dataset but then it is not opened after unpickling.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/mila-udem/fuel/issues/343#issuecomment-210071526

rizar commented 8 years ago

I refreshed my understanding of how H5PyDataset work, and for the record I put a here a slightly more verbose account of the issue. The @do_not_pickle_attributes decorator is responsible for calling load when any of the attributes that are its arguments is requested. The load method is supposed to bring the dataset back in a fully functional condition. It does it correctly only for in-memory datasets.

I think the real cause of the issue is that def open is a no-op for the in-memory case and at the same time is very important for the out-of-memory case. I agree that we should unify both cases. My proposal is:

This is different from your solution in a way that it addresses more the quirks of the current in-memory implementation. As far as I understand, if I create an H5PyDataset and just request its subsets, this will load everything in memory, because load will be called. I think this behavior should be changed.

dwf commented 8 years ago

I think yours sounds like the "right" solution but I don't have time to figure out how to implement it right now so I've sent a PR with my patch so that others can at least use this hacky fix.