psi46 / pxar

Life is too short for perfection
16 stars 46 forks source link

decoding errors @ bump bonding test #238

Closed minano closed 9 years ago

minano commented 9 years ago

Using pxar-1.6.2 and sw 3.4:

[12:23:35.733]     INFO:    PixTestAlive::aliveTest() ntrig = 10, vcal = 200 (ctrlreg = 0)
[12:23:35.733]     INFO:    ----------------------------------------------------------------------
[12:23:36.544]     INFO: Test took 811ms.
[12:23:36.568]     INFO: PixTestAlive::aliveTest() done
[12:23:36.568]     INFO: number of dead pixels (per ROC):     0
[12:23:36.590]     INFO:    ----------------------------------------------------------------------
[12:23:36.590]     INFO:    PixTestAlive::maskTest() ntrig = 10, vcal = 200 (ctrlreg = 0)
[12:23:36.590]     INFO:    ----------------------------------------------------------------------
[12:23:37.340]     INFO: Test took 749ms.
[12:23:37.374]     INFO: PixTestAlive::maskTest() done
[12:23:37.374]     INFO: number of mask-defect pixels (per ROC):     0
[12:23:37.404]     INFO:    ----------------------------------------------------------------------
[12:23:37.404]     INFO:    PixTestAlive::addressDecodingTest() ntrig = 10, vcal = 200 (ctrlreg = 0)
[12:23:37.404]     INFO:    ----------------------------------------------------------------------
[12:23:38.207]     INFO: Test took 803ms.
[12:23:38.238]     INFO: PixTestAlive::addressDecodingTest() done
[12:23:38.254]     INFO: number of address-decoding pixels (per ROC):     0
[12:23:38.269]     INFO: PixTestAlive::doTest() done 
[12:23:46.965]     INFO: ######################################################################
[12:23:46.965]     INFO: PixTestBBMap::doTest() Ntrig = 5, VcalS = 200 (high range)
[12:23:46.965]     INFO: ######################################################################
[12:23:46.966]     INFO: ---> dac: VthrComp name: calSMap ntrig: 5 dacrange: 0 .. 170 hits flags = 2 (plus default)
[12:24:03.492]  WARNING: USBInterface: Read(): data not ready (got 0b of 1b) after 15000ms yet! Will wait for up to 150000ms
[12:24:17.704] CRITICAL: <api.cc/getDecoderErrorCount:L2115> A total of 8 pixels could not be decoded in this DAQ readout.
[12:24:17.704]     INFO: Test took 30721ms.
[12:24:18.015]     INFO: PixTestBBMap::doTest() done with 8 decoding errors: 
[12:24:18.015]     INFO: number of dead bumps (per ROC): 
[12:24:18.015]     INFO: separation cut       (per ROC): 
----------------------------------------------------------------------------------------------------------------------

The test is not showing any plot. I was using the same dac parameters that worked with more previous version of pxar.

ursl commented 9 years ago

May I ask what you expect here? The log shows that you had a serious USB problem. It is anybody's guess as to what data you received. I think with the information you provided it is impossible to tell what went wrong.The test tells you that it had errors. Do you want it to do something else?

Cheers, --U.

On Tue, Oct 14, 2014 at 12:28 PM, minano notifications@github.com wrote:

Using pxar-1.6.2 and sw 3.4:

[12:23:35.733] INFO: PixTestAlive::aliveTest() ntrig = 10, vcal = 200 (ctrlreg = 0)

[12:23:35.733] INFO:

[12:23:36.544] INFO: Test took 811ms. [12:23:36.568] INFO: PixTestAlive::aliveTest() done [12:23:36.568] INFO: number of dead pixels (per ROC): 0

[12:23:36.590] INFO:

[12:23:36.590] INFO: PixTestAlive::maskTest() ntrig = 10, vcal = 200 (ctrlreg = 0)

[12:23:36.590] INFO:

[12:23:37.340] INFO: Test took 749ms. [12:23:37.374] INFO: PixTestAlive::maskTest() done [12:23:37.374] INFO: number of mask-defect pixels (per ROC): 0

[12:23:37.404] INFO:

[12:23:37.404] INFO: PixTestAlive::addressDecodingTest() ntrig = 10, vcal = 200 (ctrlreg = 0)

[12:23:37.404] INFO:

[12:23:38.207] INFO: Test took 803ms. [12:23:38.238] INFO: PixTestAlive::addressDecodingTest() done [12:23:38.254] INFO: number of address-decoding pixels (per ROC): 0 [12:23:38.269] INFO: PixTestAlive::doTest() done [12:23:46.965] INFO: ###################################################################### [12:23:46.965] INFO: PixTestBBMap::doTest() Ntrig = 5, VcalS = 200 (high range) [12:23:46.965] INFO: ###################################################################### [12:23:46.966] INFO: ---> dac: VthrComp name: calSMap ntrig: 5 dacrange: 0 .. 170 hits flags = 2 (plus default) [12:24:03.492] WARNING: USBInterface: Read(): data not ready (got 0b of 1b) after 15000ms yet! Will wait for up to 150000ms [12:24:17.704] CRITICAL: A total of 8 pixels could not be decoded in this DAQ readout. [12:24:17.704] INFO: Test took 30721ms. [12:24:18.015] INFO: PixTestBBMap::doTest() done with 8 decoding errors: [12:24:18.015] INFO: number of dead bumps (per ROC): [12:24:18.015] INFO: separation cut (per ROC):

The test is not showing any plot. I was using the same dac parameters that worked with more previous version of pxar.

— Reply to this email directly or view it on GitHub https://github.com/psi46/pxar/issues/238.

simonspa commented 9 years ago

Dear Urs,

the log above shows no "serious USB problem". It just tells you that the readout is a bit slower than we expect (exceeding the normal USB timeout set in libusub). The reason for this has been examinated, understood and solved (usage of libFTDI insetead of the recommended libFTD2XX).

However, it does not have anything to do with the test "result" since all data has been transferred and the number of events also is correct (test returned properly). I can reproduce above behaviour - just running a BBTest on a single ROC (bare module, probe card). The test finishes without producing any plot.

Examining this behaviour and also reproducable crashes of PixTestScurve I digged a bit in PixTest.cc and found some interesting features:

https://github.com/psi46/pxar/blob/master/tests/PixTest.cc#L192 the variable used here (fNDaqErrors) is not set in the whole scurveMap function, only in other functions. Being a member variable and not local, the result if this if now depends on what you did before running the test. Just returning the empty resultMaps then lead to a crash in the scurve test due to NULL pointer or so. When removing this line and also the next if statement the results are looking fine and are reproducable.

Not sure if this is some leftover from cleanup work.

Cheers, Simon

simonspa commented 9 years ago

One more question: there are no actual SCurves of single pixels plotted? That would be useful.

ursl commented 9 years ago

immediately before fNDaqErrors is checked, dacScan is called where it is filled. Not sure where your problem is.

And no, we no longer plot a zillion s-curves.

On Tue, Oct 14, 2014 at 3:29 PM, simonspa notifications@github.com wrote:

Dear Urs,

the log above shows no "serious USB problem". It just tells you that the readout is a bit slower than we expect (exceeding the normal USB timeout set in libusub). The reason for this has been examinated, understood and solved (usage of libFTDI insetead of the recommended libFTD2XX).

However, it does not have anything to do with the test "result" since all data has been transferred and the number of events also is correct (test returned properly). I can reproduce above behaviour - just running a BBTest on a single ROC (bare module, probe card). The test finishes without producing any plot.

Examining this behaviour and also reproducable crashes of PixTestScurve I digged a bit in PixTest.cc and found some interesting features:

https://github.com/psi46/pxar/blob/master/tests/PixTest.cc#L192 the variable used here (fNDaqErrors) is not set in the whole scurveMap function, only in other functions. Being a member variable and not local, the result if this if now depends on what you did before running the test. Just returning the empty resultMaps then lead to a crash in the scurve test due to NULL pointer or so. When removing this line and also the next if statement the results are looking fine and are reproducable.

Not sure if this is some leftover from cleanup work.

Cheers, Simon

— Reply to this email directly or view it on GitHub https://github.com/psi46/pxar/issues/238#issuecomment-59042909.

simonspa commented 9 years ago

Hi Urs,

my problem is a crashing test. Just because 3 pixels out of 5_170_4610 could not be decoded (e.g noise pickup with the digital probe card) we shouldn't drop a test completely. And still, if we do, pxar should still not crash. That's my problem.

And concerning the s-curves: we don't need all, but maybe one just to see it. I find it useful, especially when dealing w/ noise issues.

simonspa commented 9 years ago

P.S. without thos elines also the BBTest works just like a charm.

ursl commented 9 years ago

Hi Simon,

which test crashed? Please post an issue. In the original report, the test did not crash.

The bumpbonding test without problems if you have no readout problems.

Since checking and retrying for bad readouts is not handled by core (except for throwing, but this is not a handling let alone solution), there are many instances of work-arounds in user/testcode. Obviously, more such work-arounds are needed. They will come eventually.

Cheers, --U.

On Tue, Oct 14, 2014 at 3:40 PM, simonspa notifications@github.com wrote:

Hi Urs,

my problem is a crashing test. Just because 3 pixels out of 5_170_4610 could not be decoded (e.g noise pickup with the digital probe card) we shouldn't drop a test completely. And still, if we do, pxar should still not crash. That's my problem.

And concerning the s-curves: we don't need all, but maybe one just to see it. I find it useful, especially when dealing w/ noise issues.

— Reply to this email directly or view it on GitHub https://github.com/psi46/pxar/issues/238#issuecomment-59044726.

simonspa commented 9 years ago

Dear Urs,

We had this discussion already a couple of times now - and I'm still convinced that it would be wrong to move retry loops into the pxarCore library and thus away from any influence of the user. The reason for this is that most of the problems cannot be solved by just re-trying but by correcting the problems in the test setup (see below). The detector should work stably before attempting to do calibration. Rerunning tests with the same configuration just because it fails is a very poor attempt to get results.

Quoting form the mail that just went to HN:

There are several types of errors which can occur, with the most prominent ones being "Missing Events" and "Decoding Error" which usually have different sources as explained below. In almost all of the cases where a test fails with missing events or decoding errors the problem can be found somewhere in the test setup - be it wrong DAC settings, missing/wrong cabling, noise pick-up from whatever sources are around, or else.

In many cases the type of errors that are thrown is already pinpointing a part of the readout chain that might be problematic. However, the reasoning is slightly different for single ROC and module operation with a TBM:

Missing Events

this error means, that not all of the triggers sent to the detector resulted in an event read out and stored to the DTB RAM - i.e. the deserializer in action failed to detect beginning and/or end of an event in the data stream.

Decoding Errors:

Here the situation is slightly different. pxar not complaining about "Missing Events" means that we recorded exactly as many events as we send triggers (and possibly tokens) to the device. Decoding errors appear if the data (i.e. pixel hit information) within the well-separated events is malformed.

ursl commented 9 years ago

Hi Simon,

yes, but no. What you write is very good in principle, but we have to work with the presently existing setups (and that includes DTB/fw/core).

I have never proposed to rerun tests. What is working is to repeat api calls which had resulted in errors. To repeat a failed test is most likely useless. I think that a module fulltest with the current fw/api/readout is completely unrealistic without counteracting spurious r/o problems. Without a major effort to increase the readout stability (this effort might possibly boil down to improving PixTestSetup, but is probably not limited to that) these repeated api calls are a necessity to run a fulltest.

I agree that the experimental setups should set up properly. Of course. However, it is my experience, that I run into spurious r/o problems even if I did all I know to set up everything properly (and that is far away from Peltiers, probe stations and whatnot). Maybe/likely we have to improve the "setup" test (and in my experience, I need to tune manually the clk plus associated parameters also for modules), BUT for that we need user feedback. Instead of posting that the BB test did not succeed after a USB problem, it would be much more useful to post what had to be changed that the USB problems went away.

And finally, I wrote

... more such work-arounds are needed. They will come eventually.

I was not arguing that they should move into pxarcore.

Cheers, --U.

On Fri, Oct 17, 2014 at 5:21 PM, simonspa notifications@github.com wrote:

Dear Urs,

We had this discussion already a couple of times now - and I'm still convinced that it would be wrong to move retry loops into the pxarCore library and thus away from any influence of the user. The reason for this is that most of the problems cannot be solved by just re-trying but by correcting the problems in the test setup (see below). The detector should work stably before attempting to do calibration. Rerunning tests with the same configuration just because it fails is a very poor attempt to get results.

Quoting form the mail that just went to HN:

There are several types of errors which can occur, with the most prominent ones being "Missing Events" and "Decoding Error" which usually have different sources as explained below. In almost all of the cases where a test fails with missing events or decoding errors the problem can be found somewhere in the test setup - be it wrong DAC settings, missing/wrong cabling, noise pick-up from whatever sources are around, or else.

In many cases the type of errors that are thrown is already pinpointing a part of the readout chain that might be problematic. However, the reasoning is slightly different for single ROC and module operation with a TBM: Missing Events

this error means, that not all of the triggers sent to the detector resulted in an event read out and stored to the DTB RAM - i.e. the deserializer in action failed to detect beginning and/or end of an event in the data stream.

-

Single ROCs: most likely the DTB delays are not correctly configured and thus the output signal is distorted and/or shifted in time. Use the PixTestSetup test in pxarGUI to scan for correct delay settings and retry. Another possibility is that the readout contains many noisy pixels and is thus chopped in the middle by the next token already being sent out. We currently do not have a "safety check" in the DTB firmware like the TBM that checks for the token out signal from the ROC and only then allows to send the next one. So check that noisy pixels are masked and/or the threshold is well above noise level. It has also been reported that external noise pick-up can affect data transmission. If none of the above items solved your problems, make sure to correctly ground your setup environment, e.g. try to switch of Peltiers etc. for figuring out the source of the problems.

Module: most likely the TBM has an issue. The Events here should be clearly separated by the TBM header and trailer and the DESER400 should usually automatically align with the incoming data using the idle pattern (technical side note: we force the DESER400 to re-sync every time before a test is started). However, missing events usually mean that the deserializer does not seem to be able to lock onto the data stream. First look, if the incoming signal looks reasonable using a scope on the A1 output of the DTB and switch on the SDATA1 signal (the other one is the second 400MHz stream for the TBM09) by selecting "sdata1" from the drop down menu (GUI) or by running "SignalProbe a1 sdata1" or similar on the command lines. We also have observed that too high signal levels for the outgoing DTB signals (clk, ctr, sda) seem to affect some TBMs and produce invalid output data. Check if lowering the signal levels helps in improving data quality (global DTB setting for all signals: "level [0-15]", to be put into "tbParameters.dat", the default is 15, some TBMs needed levels as low as 4). And finally, if nothing helped, check if raising digital supply voltage Vd to 2.7 V improves anything. The setting can be found in configParameters.dat and is expressed in millivolts there (if you are on the command line: pxarCore uses SI units, so put full Volts).

Decoding Errors:

Here the situation is slightly different. pxar not complaining about "Missing Events" means that we recorded exactly as many events as we send triggers (and possibly tokens) to the device. Decoding errors appear if the data (i.e. pixel hit information) within the well-separated events is malformed.

-

Single ROC: Corrupt output data can be produced by very noisy pixels that might flood ROC buffers. Check if your threshold is high enough and you are not in the noise region for any pixel. Especially on Probe Stations the readout using the probe card and needles is reported to be extremely sensitive to noise pick-up e.g. from a badly or non-grounded vacuum chuck. There have been extensive reports e.g. by Beat Meier on Tracker Week meetings this year about this.

Module: In modules the ROC data is transmitted to the TBM using the HDI. Due to small timing jitters in transmission the TBM has dedicated registers to set delays on incoming ROC data for every port (group of four ROCs). This register can be programmed for every TBM core separately using the register name "delays". Either put it into your config file (tbmParameters_Ca/b.dat) or change it interactively using "setTbmReg('delays',value)".

— Reply to this email directly or view it on GitHub https://github.com/psi46/pxar/issues/238#issuecomment-59528917.

ursl commented 9 years ago

BTW, now "problematic" or "all" scurves can be dumped (after enabling the new checkboxes).