orgua / OneWireHub

OneWire slave device emulator
GNU General Public License v3.0
347 stars 89 forks source link

No presence after an aborted Search ROM #63

Open kutukvpavel opened 6 years ago

kutukvpavel commented 6 years ago

Hi, I am trying to develop a DS1990A emulator that is supposed to work with some generic access control systems and I've stumbled upon two issues which original 1990-s do not exhibit.

The first one is, the library does not generate a presence pulse when the slave device is first connected to the bus (i mean, physically). Obviously, master (that does not have to continuously poll the line) does not know that a device has been connected and therefore does not start communication. Of course this issue can be solved pretty easily by monitoring the bus "manually" prior to calling hub.poll(), but as an essential part of any ds2401-like slave's functionality it really should be a part of the lib. Also very long low pulses should be probably considered a disconnect from the bus... And a large pulldown resistor (I am using 100k) has to be present to avoid floating. Not sure how to implement it though, use RESET_TOO_LONG maybe.

As for the second one... I've come across a master device that issues only two timeslots after sending 0xF0 (Search ROM) command and then aborts the transaction issuing a reset (see attachment). Of course, the master expects a presence pulse to follow that reset, but OneWireHub does not generate it (at all, with out-of-the-box timings). Consequently, master restarts communication process again and again thinking that the device is disconnected every time. I've created a clone of that master using Arduino Nano and I've been able to reproduce the issue right away (and perform some tests). While I was searching for a solution I found out that I was using an old version of the library (2.0.0) and I upgraded it to the latest, however it didn't help. Using v 2.0.0 I've been able to make presence pulse appear occasionally (almost 50/50 probability ratio) by playing with timings (i reckon I mostly reduced timeslot length and increased write zero time, tweaked reset/presence length just a little bit), but when I updated the lib config was overwritten and I never managed to get presence again. In 2.0.0 the problem was somehow dependent on previous transactions. My particular master sends three commands in the following order: 0x33, 0x03, 0xF0. The problem disappeared when I changed the order to 0x03, 0x33, 0xF0. Yes, 0x03 is a non-standard one and my slave is not supposed to answer it at all, its purpose is to identify rewritable (and non-finalized) devices such as RW1990 (for them 0x03 and 0x33 are identical). Once again, for latest version order makes no difference - the issue just persists, no matter what.

I've looked through the library code and found no obvious flaws: the workflow seems completely legit. SearchIDTree exits with RESET_IN_PROGRESS error when comes across a reset pulse instead of a timeslot, then poll starts again from the beginning, i.e. from reset detection. But I can't understand why reset detection fails. Is time consumed somewhere and the pulse becomes too short to be identified as reset? My code basically calls only hub.poll() inside loop() and performs a single direct pin reading, so delays between polls are almost as short as possible. Only one DS2401 is attached to the hub. No serial or debug whatsoever. Calibration seems to be spot-on, ONEWIRE_TIME_WRITE_ZERO remains within 1uS tolerance according to logic analyzer. I'm running Arduino Uno @ 16MHz, arduino IDE v1.8.5. Here is some logic analyzer data related to the second issue. Green is genuine, blue is emulated. Look at the last pulse (and its absence). image Overall communication structure remains the same. image

kutukvpavel commented 6 years ago

Commenting out this line (OneWireHub.cpp, 283): if (!DIRECT_READ(pin_baseReg, pin_bitMask)) return true; // just leave if pin is Low, don't bother to wait, TODO: really needed? seems to solve the issue. Even with RESET_MIN set as per datasheet and with a master that generates just 10uS longer pulse (attachment). Further testing needed though. Well, this makes sense. After an aborted transaction checkReset() starts when bus is already low. That line explicitly makes the hub ignore this fact and wait for next reset. Why that line even exists? image

Edit: Added some code between polls and noticed some random presence failures again. Then I noticed the following... How is this skip supposed to work when error is always cleared? It seems it is responsible for handling a reset which is already in progress. image

I added conditional operator to the first line and failures disappeared. if (_error != Error::RESET_IN_PROGRESS) _error = Error::NO_ERROR;

Edit2: Overlooked while loop somehow. Now I get how it is supposed to work. Removed the condition and it still works as expected. Probably just some randomness is going on.

Edit3: Added this image to check whether this block is ever executed. And no, it is not. The only reason why it is not executed is this if-block is never entered. image Well, makes sense, if recvBit() is entered when previous timeslot (the one before reset) is still active. Then PRESENCE_IN_PROGRESS is to be caught inside next sendBit() called by searchIDTree() during next interation, but since this error is never detected inside checkReset() I may have got lucky: searchIDTree seems to return after execution of only one iteration due to reaching this: image Hub configuration file contains READ_MAX constant. Probably we have to compare pulse length to MAX here: image

kutukvpavel commented 6 years ago

Well, attempt to modify recvBit() failed. OW_T_READ_MIN_TO_MAX is constexpr-ed READ_MAX - READ_MIN. // wait a specific time to do a read (data is valid by then)

retries = ONEWIRE_TIME_READ_MIN[od_mode];
while ((DIRECT_READ(pin_baseReg, pin_bitMask) == 0) && (--retries != 0));
if (retries > 0) return true; //Exit if 1 was received
//Otherwise check pulse length to detect possible reset
retries = OW_T_READ_MIN_TO_MAX[od_mode];
while ((DIRECT_READ(pin_baseReg, pin_bitMask) == 0) && (--retries != 0));
if (retries > 0) return false; //Return if OK
//Return error if the pulse is too long
_error = Error::RESET_IN_PROGRESS;
return true; 

With this code hub no longer responds to any commands but reset, though it adds about 15uS of execution time only. Obviously, the quick and dirty solution now (with line 283 commented out) is to decrease RESET_MIN value, but I'll try to find better one. P.S. WTF happens to the code block here?

Edit: For now I've made the following modifications to checkReset(). image They are designed to ignore semi-missed reset pulses during first iteration of the while loop inside hub.poll() and to process them during all the other iterations. is_following_poll is a global bool that is set inside poll(): image I have a constant feeling that checkReset() can be nicely refactored, but I'm afraid to touch overdrive sections.

joysfera commented 5 years ago

@kutukvpavel have you got it working reliably at last? I'm facing serious troubles when the OneWireHub is to share the bus with real DS18B20. The timing in OneWireHub_config.h seems completely off and it kinda blocks the real DS18B20 due to too long presence pulse (or what).

Even the author's comments in the following code seems like they have no idea what's going on:

constexpr timeOW_t ONEWIRE_TIME_PRESENCE_MIN[2]      = {   160_us,  8_us }; // was 125
constexpr timeOW_t ONEWIRE_TIME_PRESENCE_MAX[2]      = {   480_us, 32_us }; // should be 280, was 480
mpeg71 commented 5 years ago

Is there anyone who has successfully emulated a ds2505?

I have a code but it just works with the first 8 byte....

Von meinem iPhone gesendet

Am 24.01.2019 um 00:07 schrieb Petr Stehlík notifications@github.com:

@kutukvpavel have you got it working reliably at last? I'm facing serious troubles when the OneWireHub is to share the bus with real DS18B20. The timing in OneWireHub_config.h seems completely off and it kinda blocks the real DS18B20 due to too long reset pulse (or what).

Even the author's comments in the following code seems like they have no idea what's going on:

constexpr timeOW_t ONEWIRE_TIME_PRESENCE_MIN[2] = { 160_us, 8_us }; // was 125 constexpr timeOW_t ONEWIRE_TIME_PRESENCE_MAX[2] = { 480_us, 32_us }; // should be 280, was 480 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kutukvpavel commented 5 years ago

@joysfera Indeed I have. Though I did not experience any major issues with timings and common 1-wire communication. My problems were related to not-so-common usage scenarios (tricky master device and having to handle connection/disconnection from the bus). As for the timings, they worked out-of-the-box, despite being a bit off (and I don't understand the comments too). Overall I don't find this lib itself buggy of unreliable, it's actually quite a good one for common applications such as ds18. Therefore I suspect that your issues have simpler causes than mine. Let's try to figure them out, starting from some obvious stuff.

  1. What device are you using? Is it clocked right?
  2. Are you using example ds18b20 thermometer code or some custom stuff?
  3. When exactly the problem arises (when emulated DS is connected together with a genuine one, or even when emulated DS is alone on the bus etc)?
  4. What have you tried so far? Do you have a logic analyzer?
joysfera commented 5 years ago

@kutukvpavel thanks for your interest. For emulating DS18B20 slave I use "Pro Mini", an arduino clone with ATMEGA328P @ 16.000000 MHz.

The problem arises when emulated DS18B20 is connected together with recent buggy/fake/counterfeit Chinese DS18B20s that don't cope with the unprecise timing. Specifically it seems (judging from my logical analyzer sleepless nights) that the too long PRESENCE_MIN time causes the Chinese fake DS18B20 to reset or something. Then they start communicating in the middle of OneWireHub talking and the whole bus goes to hell.

Unfortunately noone but me yet proved that the Chinese DS18B20 are buggy counterfeits so whenever you connect these together with the emulated ones the suspicion falls onto the emulated one.

Anyway, the OneWireHub is also to blame, because even though the PRESENCE_MIN is set to 160 us (and the comment right next to it says "was 125 us") it actually generates pulse more than 272 us long (which is out of spec: 60 to 240 us). So there's something fishy in the library as well. I have yet to figure it out where the additional 112 us is spent...

I tried shortening the PRESENCE_MIN and _MAX times (because the _MAX time is also weird - 480 us with comment "was 280 us"), got it working more reliably at first but it seems that in some extended periods of time (say a week or two weeks) something bad happens and the library stops communicating altogether (not happened to me yet, but others using the emulated device reported it to me).

So all in all, the library seems OKish at first but together with other real devices on the same bus, especially with Chinese DS18B20, does not work at all or causes weird issues.

kutukvpavel commented 5 years ago

@joysfera

it actually generates pulse more than 272 us long

Ok, then I assume the problem is due to emulator. No 1-wire device can be expected to work flawlessly side-by-side with this, chinese clones are not to blame. (actually I've never seen a genuine device operating at ~200uS presence, 130-150 is more common IIRC) Is that value at least consistent? What about other timings, are they off too? The lib measures time in while-loop-cycles. Since I know that my copy produces values inside a margin of error (let's say 5%), the cause must be related to external interference. Two things are straightforward to blame: calibration parameters and use of interrupts (showPresence() does not disable them). I suppose you didn't change VALUE_IPL for AVR, but it relies on microsecondsToClockCycles(), which relies on F_CPU. Is your build environment standard arduino IDE? Since you have a logic analyzer, have you tried calibration sketch from examples with GPIO-debug enabled? Is 1mS really 1mS?

Btw, just looked through config and noticed: // should be --> datasheet // was --> shagrat-legacy "was" relates to values that existed long time ago before forking ancient source from another person's repo. God knows why his timings were so strange back then, but nowadays it has nothing to do with our problems.

joysfera commented 5 years ago

@kutukvpavel it seems consistent: 160 us delay takes 278 us while 60 us delay takes about 104 us. There seems to be a hidden 1.75 multiplier that increases the presence pulse length. Other timings seem to be OK. I didn't change VALUE_IPL and I use standard Arduino IDE 1.8.5. BTW, the calibration sketch does not work for me - it just prints its header and that's it. Probably gets stuck in a loop somewhere.

EDIT: The waitLoops1ms() takes 892 us (the time the debug is in HIGH) and 675 us (time in LOW). FYI, I also measured Arduino's delay() and delayMicroseconds() - both were precise.

EDIT2: After fixing the calibration sketch it prints 13 instructions per loop.

EDIT3: the normally disabled part of the calibration sketch, when enabled, shows that the "try to wait 1 ms while LOW" takes 930 us and the "try to wait 1 ms while HIGH" part takes 852 us. The second block takes 853 and 847 us. I have no idea if these numbers are "correct" for the library. What's yours numbers?

joysfera commented 5 years ago

It seems like the wait() function is not working correctly. Both wait(160) and wait(160_us) take at least 230 us.

And the wait() is used at single place only: when the PRESENCE pulse is generated. So it seems 100% consistent with what I measured: library seems to work OK but the presence pulse is about 75 % longer than configured.

So it seems safe to me to lower the ONEWIRE_PRESENCE_TIME_MIN from 160_us to 60_us that makes it about 104 us in real time, very much like original DS18B20.

I created #72 for this. Thank you @kutukvpavel for kinda encouraging me to spend yet another Sunday on this issue.