labrad / servers

LabRAD servers
24 stars 21 forks source link

Qubit Server seems to intermittently not work #303

Open zchen088 opened 8 years ago

zchen088 commented 8 years ago

On Hercules (which is running the gmon experiment), every once in a while the qubit server appears to hang. The symptom is that when we run a scan, nothing happens - no error, no data saved to the datavault, the session just appears to be running the scan method indefinitely. We think it's a problem with the qubit server because when we restart it, scans seem to work again. I haven't seen this behavior on any other computers yet.

DanielSank commented 8 years ago

Any basic diagnostic info from the OS?

zchen088 commented 8 years ago

The qubit server didn't appear to be hogging a lot of memory or CPU. It is one of the older computers we have.

DanielSank commented 8 years ago

I was just wondering whether the process was idle or spinning the CPU. I'm not familiar with scala debugging techniques but if you guys can do a bit of research maybe you'll find a debugger that can let you see where the program is when it hangs.

maffoo commented 8 years ago

If you wan to look at the scala process, try Visual VM: https://visualvm.java.net/

However, I would guess that it's the DAC boards timing out. Are you running the latest version of the ghz fpga server? Does it log any timeouts when the slowdown happens?

DanielSank commented 8 years ago

Would restarting the qubit server would make the problem go away if it's board timeouts? I guess we'll find out.

pomalley commented 8 years ago

Restarting the qubit server causes the boards to get pinged for their build numbers. It's possible (or at least conceivable) that this could get the boards working a bit longer. But I agree it's unlikely.

zchen088 commented 8 years ago

Doesn't appear to be a boards problem - I can bringup the boards while the data taking is hung. When I try the echo setting on the qubit sequencer, it also hangs without any error messages. Anecdotally, it seems related to multiple people trying to take data at once.

zchen088 commented 8 years ago

The qubit server was on version 0.6.2, and I've now updated to 0.7.0. I now get errors likes this: Error: (0) java.lang.OutOfMemoryError: Java heap space [payload=None]

DanielSank commented 8 years ago

o_O

Does that happen on startup or under some other condition?

zchen088 commented 8 years ago

More info: the qubit sequencer often fails when I switch to a new dataset folder/registry wrapper.

DanielSank commented 8 years ago

That's interesting. @maffoo does the qubit sequencer cache data related to each run's configuration? I'm surprised this would eat enough memory to matter either way though...

pomalley commented 8 years ago

It does store the data, yes, because each experiment is a series of calls (initialize, upload SRAM, upload JT, etc). I think it should get cleared each time you re-initialize at the beginning of an experiment, but I suppose there might be a leak in there.

Also, this wouldn't really explain it because as far as the sequencer is concerned there's no difference between the first run of a new dataset and run n+1 where you just increment the delay time (for example). Uh, right?

On Tue, Jan 26, 2016 at 10:52 PM, Daniel Sank notifications@github.com wrote:

That's interesting. @maffoo https://github.com/maffoo does the qubit sequencer cache data related to each run's configuration? I'm surprised this would eat enough memory to matter either way though...

— Reply to this email directly or view it on GitHub https://github.com/martinisgroup/servers/issues/303#issuecomment-175375848 .

maffoo commented 8 years ago

@DanielSank, @pomalley, yes, all data associated with a run of the qubit server is stored the context gets expired, or, more often, reinitialized for a new run. This is not a lot of data (on the scale of JVM memory use), and typically we only use ~10 contexts per user and just cycle through them, so this would not use up increasing amounts of memory unless people are bypassing the context management in pyle, or opening up lots of new pyle sessions and keeping them all open, or something like that.

I have VirtualVM running on hercules monitoring the qubit server, so if it hangs again we can hopefully get a better idea why that happened. So far there haven't been any hangs since starting VisualVM (watched pots and all that).