Open colganwi opened 2 months ago
Thanks for reporting - this indeed does look like a real issue related to the authentication layer. The model downloading component implements a simple cache, but the model extraction component doesn't - so what I think is happening is that the .tar.gz
is being extracted in every run, which can definitely cause issues if one process is reading while another is overwriting with a newly extracted stream.
I suspect the most straightforward fix would be to add caching to the model extraction piece as well.
Thanks for looking into this. Let me know when you have a patch. For now, I'm able to work around it my not running to many jobs in parallel and using the try-catch above.
Describe the bug It seems like Mesmer reads the TF SavedModel in write mode which means that multiple processes cannot load Mesmer simultaneously. This results in flaky instantiation when running Mesmer in parallel on a HPC cluster.
To Reproduce Run the code below with >20 cores. If one core is currently loading Mesmer other cores will throw
Read less bytes than requested
or a number of other errors.Code:
Running:
Error:
Expected behavior Initiating Mesmer should be reliable and not include any file locks or write operations
Desktop (please complete the following information):
Additional context Add any other context about the problem here.