DAC Build 8 synchronization problem

patzinak commented 8 years ago

When we run two Build 8 DACs in a daisy-chain configuration we see a recurring and not that easy to catch synchronization problem. This problem exhibits itself as a 4 ns slip between two waveforms generated by two physically different boards. We can run boards for a day without having any issues but next day it may start appearing every few minutes. Sometimes the slip disappears after a while. We never saw accumulation of the slips. Such an asynchronization event may occur anytime from a couple minutes to many hours after the boards bring-up. Running the bring-up script always solves the issue for some time. The up-to-date LabRAD (Scala-based)/Ethernet (Scala-based) and GHz FPGA servers are used. We use Build 8 because we have to read the timing data.

Please compare the intersection region of the red and yellow pulses in these pictures: 2015-12-30 17 46 43 2015-12-30 17 51 26

We started doing full automatic board bring-up if we see any of the board PLL locks getting unlocked and this seem partially solve the issue (still in the process of verifying this). However, we noticed that the PLLs can sometimes get unlocked every or every other run of the measurement sequence while sometimes we can do measurements for many hours without any unlocking events. Change in the behavior can happen even when nothing has been physically touched. Most of the times the PLLs are unlocked together but sometimes only one of the board unlocks. PLLs clearly unlock more often the slip occurs but there seem to be a positive correlation.

@amopremcak

DanielSank commented 8 years ago

I can't give you a particularly sound explanation of why this is, but I'd take 10 to 1 odds that this is a problem with either the power supply or the 10 MHz reference.

Check the 10 MHz reference cables. Are they tight on both ends? Is there visible damage to the cables? What happens if you put the two channels of the reference on the oscilloscope? Are they aligned? Is there jitter? What model reference are you using? What model distribution amplifier are you using (or is it built into the reference)?
Check the power supply. Are all the connections tight? Pull on the individual wires hard. They should not come out. Take a photo of the connection between the power cable and the back of the power supply and post it here. Are you using metal shoes to connect the wires to the back plane of the supply or do you have bare braided wire clamped into those clampy things?
How many boards are on the same power supply? If the cables are long you can drop enough voltage over the cables to undervolt the boards and then you see all kinds of fun intermittent problems.

DanielSank commented 8 years ago

Pinging @ejeffrey and @JulianSKelly for ideas.

ejeffrey commented 8 years ago

For 4 ns time slips I would first check the daisy chain cable length. If you have cables that are 6-12 inches longer and shorter, try those. If the second board is very occasionally 4 ns later triggering than you expect use a shorter cable, if it is occasionally early, use the longer cable. If you are unsure, just try both options.

If you don't have the necessary length UTP cables, you can alternately try shortening/lengthening the 10 MHz clock cable, that will have the same effect. So if you have an 8 inch BNC cable, you can hook that inline with a barrel connector on either the master or slave board and see if the problem goes away.

We have noticed that the SRS FRS725 10 MHz clock source can have up to 7 nanoseconds channel skew between outputs, which can cause this type of problem. This can also be really annoying if you are trying to debug something and check all the connections, but then accidentally switch ports on the 10 MHz reference. If you are using a 10 MHz reference that has skew between channels, label each output so you always use the same one for each board.

Finally do check the power supply voltages. Actually test this first because it is so easy. Measure with a voltmeter right at the turrets on each board and make sure the voltage is correct. This can cause intermittent problems as a sagging voltage supply can cause the logic thresholds or propagation delays to change slightly, leading to timing variation, which would explain the intermittent behavior if your cable delays are right on the edge.

I honestly don't have any good explanation of why running the bringup script would change anything, but this is my standard starting point for diagnosing timing problems.

Evan

On Sat, Jan 9, 2016 at 10:19 AM, Daniel Sank notifications@github.com wrote:

I can't give you a particularly sound explanation of why this is, but I'd take 10 to 1 odds that this is a problem with either the power supply or the 10 MHz reference.

-

Check the 10 MHz reference cables. Are they tight on both ends? Is there visible damage to the cables? What happens if you put the two channels of the reference on the oscilloscope? Are they aligned? Is there jitter? What model reference are you using? What model distribution amplifier are you using (or is it built into the reference)?

Check the power supply. Are all the connections tight? Pull on the individual wires hard. They should not come out. Take a photo of the connection between the power cable and the back of the power supply and post it here. Are you using metal shoes to connect the wires to the back plane of the supply or do you have bare braided wire clamped into those clampy things?

How many boards are on the same power supply? If the cables are long you can drop enough voltage over the cables to undervolt the boards and then you see all kinds of fun intermittent problems.

— Reply to this email directly or view it on GitHub https://github.com/martinisgroup/servers/issues/300#issuecomment-170266865 .

ejeffrey commented 8 years ago

One more thing, Dan asked about how you were connecting wires to the power supply. The terminals he is talking about are like these:

http://www.mouser.com/ProductDetail/Panduit/PV18-P47-MY/?qs=sGAEpiMZZMsg%252bdojMTmmLBEr%252bcarA25d

You crimp those onto your normal stranded wires and then insert that into the screw terminal. This is much more reliable and secure than bare or tinned wire. If you are not using these, I highly recommend installing them. It may not be related to the problem you are having now, but it will dramatically reduce the odds of a connecting getting loose over time and leading to future problems.

patzinak commented 8 years ago

Daniel, Evan, thanks a lot for your feedback!

As for the power, we do not use the crimp terminals on the wires. This is now on our to-do list. As for the cables, I personally fixed a couple issues with them a few months ago and they are as good as they can be given the connector type/wire gauges/place for the stress relief/etc. I do not like them (the difference in lengths, wire gauges, the connector) but they should definitely do the job.

The back of the power supply panel looks like this. The sockets are chosen based on the anecdotal reliability evidence... In my experience, apart from one obviously broken socket on this particular power supply, they all work equally well. 2016-01-10 21 20 38

We have two DACs and one ADC connected to the power supply. The DACs voltages measured at the turrets with a trusted DMM, dropping the last digit, are:

 master      slave
-5.632 V    -5.644 V
 5.475 V     5.458 V
 3.755 V     3.700 V
 3.024 V     3.000 V
 1.617 V     1.605 V

The power supply panel reads: 2016-01-10 22 45 29

Unquestionably, the current readings for the fourth power supply is invalid. As for the voltage readings, they are stable within +/-0.01 V.

The voltages are well above the numbers given in GHzDAC_DataSheet.doc. The numbers for the +5.5 V and +1.65 power supplies are less than 0.05 V but lower than the values given on the power supply Wiki page. I do not understand everything perfectly but the voltages seem to be within the limits of the power regulators/drivers listed in PartsGHzDAC.pdf.

Does anything with the power look suspicious and should be fixed immediately?

patzinak commented 8 years ago

As for the length of the Ethernet cables, in the documentation I saw the following statement: "The length of the interface cables may be calibrated for maximally stable operation by using a separate FPGA test program." If this code is openly distributed, where can we find it/how can we use it? Also, I am still a bit confused whether the optimal daisy-chain cable length should be the same for all racks/setups or it is expected to vary because the boards are not physically identical.

amopremcak commented 8 years ago

I've been trying to pinpoint the source of this issue on rack different from the one that @patzinak saw this issue on in the first place. Apart from two different DACs, the setup is using a different 10 MHz reference clock (SRS SIM 940 10 MHz Rubidium Frequency Standard). The clock signals arriving at the DACs are 7 dBm. Since the timing slip noticed by @patzinak occurred on a time scale of hours after the initial DAC bringup, I setup some overnight runs in an attempt to reproduce this on my rack. Both boards were configured to output square waves at the same time. Without querying the boards for any information, I was unable to produce a timing slip > 4 ns over the course of 12 hours ( which corresponds to 10000 runs). Let us refer to channels A and B on board 1 (2) by A1 (A2) & B1 (B2). The data below shows the time delay between two channels on the same board (blue) as well as two channels on different boards (green) notimeslip I then modified to script to query the PLL state of each board before outputting a square wave, and saw some interesting behavior. The data can be seen below Only the green data set contains these ~ 50 ns time slips which corresponds to a time slip between two different boards. A representative voltage time series for channels A1 & A2 corresponding to one of these 50 ns slips is given by voltagetimeslip where as a representative time series for channels A1 & A2 corresponding to no slipping is given by voltagetimeseriesnoslip We ordered the terminals recommend by @ejeffrey but they have not yet arrived. With regards to what @DanielSank suggested, the 10 MHz cables are securely fastened. There is no visible damage on any of my cables. The clock signals are aligned and the jitter is extremely small, undetectable due to the noise floor of our scope. I can sample the power but I am not sure what to look for other than maybe stability over long time scales. Any thoughts?

DanielSank commented 8 years ago

When you say "query the PLL state", what exactly did you do?

amopremcak commented 8 years ago

Using the ghz fpga server, I select a DAC using the select_device setting, then I call the pll_query setting.

DanielSank commented 8 years ago

@ejeffrey can you think of a way that unreliable daisychain connections might cause the few ns time slips? The 50 ns slips after querying the PLL is totally bewildering to me.

The one thing I can think of is @jwenner's famous phases scan thing, so perhaps he can comment.

amopremcak commented 8 years ago

Also I've noticed that with fixed registry settings and cable lengths, I've been able to produce to different delays between each of the boards, in one case a steady delay of roughly 3 ns (like the data shown above) and in the other case, a steady delay of roughly 20 ns. With PLL querying, I've been able to get a ~ 50 ns timing slip in both case, in other words from 3 ns to (3 + 50) ns (above), and from 20 ns to (20 +50) ns. Is there a preferred cable daisy chain cable length? Also is there a preferred length when going from the switch to each of the boards?

DanielSank commented 8 years ago

As for the length of the Ethernet cables, in the documentation I saw the following statement: "The length of the interface cables may be calibrated for maximally stable operation by using a separate FPGA test program." If this code is openly distributed, where can we find it/how can we use it?

I think that's referring to the bringup scripts. @jrainbo I thought we had a repo with that stuff but I don't see it now in our group's github page.

Also, I am still a bit confused whether the optimal daisy-chain cable length should be the same for all racks/setups or it is expected to vary because the boards are not physically identical.

We find that if your 10 Mhz references are aligned and you use the same length clock cables for each board, then the daisy chain cables are all the same length. How long are your daisy chain cables?

jwenner commented 8 years ago

@DanielSank, https://github.com/martinisgroup/ghzdac8/tree/master/fpga/ghzdac/GHzDACHardwareTests (private repo). The desired FPGA code is DACtest2.

For my scanPhases code, that is to correct jumps of 1ns between the two channels of the same DAC, with changes from DAC bringup to DAC bringup. Hence, I don't think that's the issue here. Note that I never characterized long- (or even medium-) term stability of the timings.

I agree with @ejeffrey that the best bet is to try changing the length of the daisy chain cables. The ideal lengths seem to be either 2' or 3' depending on the FPGA *.pof file used. In particular, I think that some versions of Builds 7/8 require 2' and other versions require 3' due to a difference in the Quartus version used to compile the FPGA code. (And yes, there are different versions with the same build number.) I seem to recall that, so long as the same FPGA firmware is used on all boards in the rack and subject to the conditions @DanielSank listed above, the daisy cables should all be the same length. However, mix versions of Build 8 (or Build 7), and you may have to use different lengths within the rack.

jwenner commented 8 years ago

(As a note for someone - e.g., @ejeffrey, @DanielSank, @jrainbo - to ask John) How do the PLLs trigger off the 10MHz clock? Do they trigger off just the rising or the falling edge, or do they trigger whenever the 10MHz clock passes through 0? If the latter, this could be a source of 50ns jumps.

jwenner commented 8 years ago

So far as I know, there is no preferred length for the Ethernet cable from the switch to the board. So far as I know, this is used strictly for communication with the computers, not for timing.

amopremcak commented 8 years ago

The daisy chain cables for my rack are 24 inches long @DanielSank . I can characterize the jitter of our reference clock signal tonight just to make sure that we are indeed stable over the time scales shown in the data above. All of our DACs are build no. 8. Perhaps we can get our hands on the latest version of build no. 8 and reflash all of our DACs so that we have uniformity across setups and so that we can fix an ethernet cable length for all of our racks.

patzinak commented 8 years ago

We find that if your 10 Mhz references are aligned and you use the same length clock cables for each board, then the daisy chain cables are all the same length. How long are your daisy chain cables?

We use 24'' cables on the rack that got the original issue. I tried 25'' (somehow we got a set of these in our lab) and 3" ones. The 25" cables didn't make any apparent difference. The boards did not like 3' cables at all.

Just a couple clarifications, none of the PLL quires in @amopremcak tests indicate that PLL is unlocked, even when ~50 ns slips occur. The other issue he sees is that some register settings (or I guess start_delay FPGA server setting) are not reliable in the sense that if you power cycle the boards than the delay between the boards can change.

amopremcak commented 8 years ago

I gathered some more data on the registry settings leading to inconsistent delays between boards. The procedure is as follows:

1) Power cycle the electronics, start the appropriate servers for communications with the boards, run the DAC bring up script, and output square waves on channels A1, B1 (of board 1) and A2, B2 (of board 2). With the start delay between boards in the registy given by ("DAC 11",6),("DAC 12",6), I sampled each of the outputs and obtained the following plot powercycleplusstartup which is an approximately 20 ns delay between board 1 and board 2.

2) I then set the start delay to ("DAC 11",6),("DAC 12",2) in order to minimize the delay between board 1 and board2. Without doing anything else, I re-run my script for outputting square waves, sample the waveforms and find adjustregistrysettingstominimizedelay which is ~2 ns delay between boards. So far so good.

3) I then re-run the DAC bring-up script and with the registry start delays fixed at ("DAC 11",6),("DAC 12",2) and I get rerunbringupafteradjustedregistrysettings which is identical to the plot above.

4) Then I close down all labrad servers in order to close communication with the boards, restart the appropriate servers, run DAC bring up with the same exact registry settings and I get closecommunicationrestartserversplusbringup which leads to an entirely new delay with board 2 (slave) preceding board 1 (master) by ~ 20 ns. Also notice that A1 and B1 are slightly out of sync by ~2 ns.

5) I then re-run bringup once more to see if this problem goes away rerunbringupaftercommunicationclosureplusinitialbr which fixes the synchronization problem between channels A1 and B1 but leaves the delay between boards 1 and 2 as above. Power cycles hereafter lead to the same delay (as shown above) between boards 1 and 2 insofar as I can tell. The small delay between channels A1 and B1 occurs sometimes but not always.

I have seen these issues with 2' and 3' daisy chain cables between my boards. Has anyone seen this type of behavior before?

DanielSank commented 8 years ago

Paging @jwenner I think we ought to send the scan phases thing and see what we see.

jwenner commented 8 years ago

@amopremcak, while I don't know about the large 20ns jumps, I suspect that my ScanPhases code can help for the ~1-2ns difference between A1 and B1. Instructions for dealing with this can be found at https://matrix-reloaded.physics.ucsb.edu/twiki/do/view/Electronics/ChangingPhases (See our group website under Electronics for user name + password). However, I'll need to send you the scripts to use along with the FPGA code to try; is there an email address where I can send these to?

By the way, whenever you do DAC bringup you should make sure that you aren't getting FIFO or BIST failures. If you use the standard bringup script (https://github.com/martinisgroup/servers/blob/master/GHz_DAC_bringup.py), it will tell you when there are failures. If you are directly calling dac_bringup in the ghz_fpga server, when you get the long list out at the end, make sure that FIFO Success and BIST Success are True for both DAC A and DAC B.

amopremcak commented 8 years ago

My email address is opremcak@wisc.edu. The DAC bringup script I am running is the same as the one given by the link above but this is handy to know for future reference. I am going to reflash both of my DACs with the same firmware version tomorrow and I will experiment with different daisy chain cable lengths (2' and 3'). Perhaps both of my DACs are running on different versions of build #8. I am still waiting on the terminals you guys recommended for my power supply to arrive. Thanks for all of your help thus far @jwenner, @DanielSank, and @ejeffrey.

jwenner commented 8 years ago

@amopremcak, I'm sorry about the delay in sending you the ScanPhases and FPGA code; I was updating the TWiki instructions so they don't refer to out-of-date versions of ScanPhases and PrintPhases.

I just emailed the ScanPhases, PrintPhases, and FPGA code to you. Could you please let me know if you get it? Note that, due to the number of FPGA SOF/POF files, it is over 10MB. If you don't get the email, I can break it up into multiple smaller emails.

Feel free to let me know if you have questions.

amopremcak commented 8 years ago

Hey Jim,

No worries on the delay in sending these files. Updating the wiki instructions will be very helpful for me as I continue to work with these boards. I'll be sure to ask if I have any questions. Thanks for your help.

-Alex

From: Jim Wenner notifications@github.com Sent: Monday, January 25, 2016 3:39 PM To: martinisgroup/servers Cc: ALEXANDER M OPREMCAK Subject: Re: [servers] DAC Build 8 synchronization problem (#300)

@amopremcakhttps://github.com/amopremcak, I'm sorry about the delay in sending you the ScanPhases and FPGA code; I was updating the TWiki instructions so they don't refer to out-of-date versions of ScanPhases and PrintPhases.

I just emailed the ScanPhases, PrintPhases, and FPGA code to you. Could you please let me know if you get it? Note that, due to the number of FPGA SOF/POF files, it is over 10MB. If you don't get the email, I can break it up into multiple smaller emails.

Feel free to let me know if you have questions.

Reply to this email directly or view it on GitHubhttps://github.com/martinisgroup/servers/issues/300#issuecomment-174676863.

jwenner commented 8 years ago

@amopremcak, so have you gotten the email with the files yet, or do I need to resend it?

amopremcak commented 8 years ago

Yes the email included the files @jwenner. Thanks again.

DanielSank commented 3 years ago

I've deleted a message posted here that was spam. Discussion here should be limited to DAC Build 8 synchronization and not e.g. requests for access to the Martinis lab's Wiki.

labrad / servers

DAC Build 8 synchronization problem #300