qzind / tray

Browser plugin for sending documents and raw commands to a printer or attached device.
https://qz.io
Other
837 stars 272 forks source link

Cannot read property 'version' of null after upgrading to 2.1.0 RC7 #524

Closed lorro closed 4 years ago

lorro commented 4 years ago

Hi,

A while ago we developed a POS software which has been working fine so far. Printing of receipts and opening of the cash drawer runs through the QZ tray software. The first cash registers that have been registered in the POS are using our own compiled version of QZ Tray 2.1.0-RC5, without issues.

However, the last couple of cash registers that we registered in the POS software are using a newer compiled version of QZ Tray 2.1.0-RC7. Now regulary some issues pop up when an employee tries to print/open the drawer. The following javascript error is returned regularly.

TypeError: Cannot read property 'split' of undefined (Most recent call first)
File webpack:///./own_modules/qz-tray/qz-tray.js line 627 col 1 in compatible
   var semver = _qz.websocket.connection.version.split(/[.-]/g);
File webpack:///./own_modules/qz-tray/qz-tray.js line 1169 col 1 in print
   _qz.tools.compatible(data[d]);
TypeError: Cannot read property 'version' of null (Most recent call first)
File webpack:///./own_modules/qz-tray/qz-tray.js line 627 col 1 in compatible
   var semver = _qz.websocket.connection.version.split(/[.-]/g);
File webpack:///./own_modules/qz-tray/qz-tray.js line 1169 col 1 in print
   _qz.tools.compatible(data[d]);

After doing a manual browser refresh printing works again. That's however not really a solution as the POS is a single page application.

Anyone has an idea what direction we should look into to fix this issues? Has any remarkable changes been made that could cause this behavior? The connection with QZ Tray is setup on first page load. Is the connection with QZ tray lost for some reason perhaps?

Thanks

tresf commented 4 years ago

If _qz.websocket.connection.version is null or undefined, it's a sign that the websocket has closed, nulling it out.

This usually occurs if something closed out the connection (observable by the QZ Logs). Often the connection is not obvious, but can be caused by page navigation (back/POST/DOM reload/variable redefine).

lorro commented 4 years ago

Hi @tresf, I understand this indicates the websocket has been closed, but it's kinda odd that all the cashregisters that are running 2.1.0-RC5 have never had this and still don't have. The few cash registers that are using 2.1.0-RC7 have several notifications of this on a daily basis.

Until now I haven't been able to reproduce the issue on my own physical device either :/ I ran some tests with 2.1.0-RC7 too but everything worked fine. So I had some suspicions the connection got timed out f.e.

I'll try if I can see if there's any useful info in these logs. The connection is actually initiated by me when the root node of the application (App.vue) gets mounted to the DOM.

tresf commented 4 years ago

@lorro have you collected the logs from the workstations? There may be something indicative of a crash.

Pinpointing the code changes between RCs in an attempt to identify if a code change has caused this is a bit difficult due to the "unreleased" state of the 2.1 branch, but we can infer...

This introduces 11 months of commits -- including anything synced over from the 2.0 branch -- changes are pretty massive, so catching a crash in the logs that's taking down the websocket might be our best path...

https://github.com/qzind/tray/compare/ef3d2aa4e3a22f456177c08cc6ce9756f99aabda...2.1

lorro commented 4 years ago

@tresf I managed to get an export of the debug.log of two different cash registers, one with RC5 and the other RC7. In both files I found a few lines that indicate a closing of the connection. However in RC7 it's often because of a "WebSocket Read EOF". In RC5 I only see it because of "Disconnected". I reckon this is by shutting down the browser or computer. But never because of "Websocket Read EOF".

RC7:

2019-11-25 13:58:13,887 [DEBUG] Searching for PrintService matching CITIZEN CT-S310II
2019-11-25 13:58:13,888 [DEBUG] Found 6 printers
2019-11-25 13:58:13,888 [DEBUG] Found match: CITIZEN CT-S310II
2019-11-25 13:58:13,888 [DEBUG] Found Resolution: 20300x20300 dphi
2019-11-25 13:58:13,909 [DEBUG] Parsed 1 html records
2019-11-25 13:58:14,065 [DEBUG] Paper area: 7,0:212,9286
2019-11-25 13:58:14,065 [DEBUG] Document will be printed across 1 pages
2019-11-25 13:58:14,920 [INFO] Printing complete
2019-11-25 14:18:42,461 [INFO] Connection closed: 1006 - WebSocket Read EOF
2019-11-25 14:18:42,462 [INFO] Closing all communication channels for <xxx>
2019-11-25 15:17:51,766 [INFO] Connection opened from /0:0:0:0:0:0:0:1:61856 on socket port 8181

RC5:

2019-11-20 14:01:02,551 [DEBUG] Searching for PrintService matching CITIZEN CT-S310II
2019-11-20 14:01:02,552 [DEBUG] Found 4 printers
2019-11-20 14:01:02,552 [DEBUG] Found match: CITIZEN CT-S310II
2019-11-20 14:01:02,552 [DEBUG] Found Resolution: 20300x20300 dphi
2019-11-20 14:01:02,568 [DEBUG] Parsed 1 html records
2019-11-20 14:01:03,086 [DEBUG] Paper area: 7,0:212,9286
2019-11-20 14:01:03,097 [DEBUG] Document will be printed across 1 pages
2019-11-20 14:01:03,586 [INFO] Printing complete
2019-11-20 16:41:27,950 [INFO] Connection closed: 1006 - Disconnected
2019-11-20 16:41:29,463 [INFO] Closing all communication channels for <xxx>
2019-11-20 16:41:48,096 [INFO] Connection opened from /0:0:0:0:0:0:0:1:64850 on socket port 8181

Does this mean anything to you? :) Thanks

tresf commented 4 years ago

Does this mean anything to you? :)

Yes, sort of... We started catching this error in our RC9 version due to -- what we believed to be -- some issues with the newer Jetty libraries.

What's strange is on RC9 it just caused a popup tray notification, it wasn't causing the qz websocket object to die. The logs don't inditicate it would cause it to die either (nor do yours).

Would you be willing to share the full logs with us? support@qz.io?

Then we can see if there's something unobvious happening.

lorro commented 4 years ago

Would you be willing to share the full logs with us? support@qz.io?

I just sent you an email with the full log.

tresf commented 4 years ago

Thanks. I notice that the logs show both {type: "pixel" ... }and {type: "raw" ... }. Does the problem appear to occur more with one over the other?

I can't see anything obviously wrong with them either. I'm not sure why the websocket is dying on you.

Connection closed: 1006 - WebSocket Read EOF

What confuses me is that both logs indicate Closing all communication channels, so if this is timing with a page navigation, it should be consistent between versions. Furthemore, we've never had this specific problem reported to us where it was NOT a JavaScript coding issue.

Can you make an example that causes this issue in a stand-alone unit test? If so we can run the software in debug and see what's going on.

lorro commented 4 years ago

@tresf: I managed to replicate this error on my own physical device now. I noticed this should have something to do with the version of Java that is installed on the machine. (And not necessarily with the version of QZ Tray that is installed).

I just updated my Java to the latest version (version 8 update 231) and the first time I tried printing something after starting the QZ Tray application I got the "cannot read property version of null" notification.

After refreshing the browser, printing does work.

tresf commented 4 years ago

@lorro thanks. Once you have some steps/code for us, we'll happily take a look.

lorro commented 4 years ago

@tresf: Sorry, I think I took my conclusions a bit too early. Seems I just didn't reload our application after restarting the QZ Tray application (after the update to the newest Java version). Hence the QZ Tray connection wasn't re-established so I got that error.

It could be related to the Java version but actually we're not sure yet. We also tried downgrading to RC5 on one of the affected cash registers that were installed with RC7 and after that the error persists so it's probably not directly related to the build version. So the only difference between cash registers that have issues and those that have not is probably the Java version. But it's hard to replicate as it's an issue that only pops up once in a while.

P.S. For now I attached a callback to the setClosedCallbacks function to re-establish the connection whenever it's lost.

qz.websocket.setClosedCallbacks(() => {
  setTimeout(() => {
    this.connectToPrinter()
   }, 100)
})
tresf commented 4 years ago

P.S. For now I attached a callback to the setClosedCallbacks function to re-establish the connection whenever it's lost.

Great, thanks.

One more thing I was thinking about last night about this problem is that certain WebSocket implementations failed sporadically for us in our unit tests. Is there a chance you're using our API overrides to use a 3rd-party (non-browser) WebSocket implementation? If so, that could be the culprit. I was writing some unit tests in Node.js years ago and remember encountering this. It's a shot in the dark, but I wanted to mention it.

lorro commented 4 years ago

Deleted my previous comment. Didn't understand the question at first. This is our API implementation for the QZ Tray connection.

initPrinterConnection () {
  qz.api.setPromiseType(function promise (resolver) {
    return new RSVP.Promise(resolver)
  })

  qz.api.setSha256Type(function (data) {
    return createHash('sha256').update(data).digest('hex')
  })

  qz.security.setCertificatePromise(function (resolve) {
    return axios.get(`${domain}/cert.pem`).then(data => resolve(data.data))
   })

  qz.security.setSignaturePromise(function (toSign) {
    return function (resolve) {
      return axios.get(`${domain}/sign`, { params: { request: toSign } }).then(data => resolve(data.data))
    }
  })

  return qz.websocket.connect({ retries: 5 })
    .then(() => Promise.resolve())
    .catch((data) => Promise.reject(data))
  }
lorro commented 4 years ago

Unfortunately, my mentioned "solution" from above apparently didn't fix this issue completely. From the two different error notifications I saw coming in before there's still one that pops up now and then...

That's this one:

TypeError: Cannot read property 'split' of undefined (Most recent call first)
File webpack:///./own_modules/qz-tray/qz-tray.js line 627 col 1 in compatible
   var semver = _qz.websocket.connection.version.split(/[.-]/g);
File webpack:///./own_modules/qz-tray/qz-tray.js line 1169 col 1 in print
   _qz.tools.compatible(data[d]);

The other one however seems to be fixed, as we don't see that anymore in the logs.

TypeError: Cannot read property 'version' of null (Most recent call first)
File webpack:///./own_modules/qz-tray/qz-tray.js line 627 col 1 in compatible
   var semver = _qz.websocket.connection.version.split(/[.-]/g);
File webpack:///./own_modules/qz-tray/qz-tray.js line 1169 col 1 in print
   _qz.tools.compatible(data[d]);

So apparently the _qz.websocket.connection always has a correct value, but the version property sometimes becomes undefined.

tresf commented 4 years ago
 qz.api.setPromiseType(function promise (resolver) {
    return new RSVP.Promise(resolver)
  })

  qz.api.setSha256Type(function (data) {
    return createHash('sha256').update(data).digest('hex')
  })

Thanks. Some frameworks require qz.api.setWebSocketType(...), which is where I saw a similar problem in the past. If you're not using this, it rules that out.

So apparently the _qz.websocket.connection always has a correct value, but the version property sometimes becomes undefined.

You also have a 100ms + natural reconnect time for it to recover, so any calls between those events will fail. There's no away to avoid this except to identify what's dropping the connection in the first place.

One thought... I see you use axios.get(...) for AJAX. Can you add an error handler to that call to see if the error coincides with a 404 style problem?

Furthemore, at this point, it would help if you could isolate the problem to a small reproducible example so that we could reproduce on our end.

lorro commented 4 years ago

Furthemore, at this point, it would help if you could isolate the problem to a small reproducible example so that we could reproduce on our end.

I surely understand, problem is I can't really reproduce the issue myself. It's only certain employees that work at a remote store that have the issue sometimes (but not every one). I haven't been able to reproduce it on my own machine, so don't have quite an idea what the culprit is. So it's kinda hard.

For now I commented out some lines in qz-tray.js to resolve this case for our very own usecase.

//clean up data formatting
for (var d = 0; d < data.length; d++) {
  _qz.tools.relative(data[d]);
  // _qz.tools.compatible(data[d]);
}

Noticed the only thing that commented line does is converting a message to version 2.0 format. And splitting the textual numeric version number into an array in doing so. That was where the error popped up and blocked execution of the rest of the code. We're certain we're always using version 2.1. At least they can open the drawer now :)

tresf commented 4 years ago

Just to rehash this problem...

POS are using our own compiled version of QZ Tray 2.1.0-RC5, without issues.

However, the last couple of cash registers that we registered in the POS software are using a newer compiled version of QZ Tray 2.1.0-RC7. Now regulary some issues pop up when an employee tries to print/open the drawer. The following javascript error is returned regularly.

My instinct is that this is identical to this issue, but we never backported the fix to RC7. https://github.com/qzind/tray/pull/504/commits/fa36a272fce52d2f46b67e461e8164205b567754

RC9 is still considered experimental, but it's passing most tests identically to RC7 if you're interested: https://github.com/tresf/tray/releases/tag/v2.10-RC9

Furthemore, we can back-port the patch to RC7 if needed, but it would be a one-off build, all future updates would land on RC9 or 2.1.0 proper.

tresf commented 4 years ago

504 has been merged, which contains the patch mentioned above. Closing. Feel free to request a re-open if this problem resurfaces with RC9 or higher.

billdami commented 4 years ago

Hi @tresf , I'm developing against the latest v2.1 QZTray (running on macOS catalina), and am routinely seeing similar issues with the javascript client encountering fatal Cannot read property 'split' of undefined errors even when the websocket connection is successfully established. Usually, just exiting and restarting the QZtray instance will fix the problem temporarily.

TypeError: Cannot read property 'split' of undefined
    at Object.isVersion (qz-tray.js:638)
    at Object.algorithm (qz-tray.js:716)
    at Object.setSignatureAlgorithm (qz-tray.js:2080)

There seems to be a correlation with the order of websocket events that are received. If the first websocket message received by the client does not contain the version number, the error will happen, even though the subsequent message does contain it. In this screenshot, I get the fatal JS error: Screen Shot 2020-06-04 at 9 29 11 AM (notice the first message is just an object with only a "uid" field)

When working properly, with no issues, the first message received appears to always be the version number, like this: Screen Shot 2020-06-04 at 9 30 06 AM

Could this possibly be caused by my implementation on the JS/client side, i.e. some sort of race condition or something?

tresf commented 4 years ago

It could be a race condition, yes. Can you share some of your startup code so we can look for race conditions?

For example, normally, you'd call

If not using a proper promise chain, you can run into race considtions.

The promise chain is historically done using Promise.then(Promise.then(...)) however the await keyword will work as well if placed inside an async function.

billdami commented 4 years ago

It could be a race condition, yes. Can you share some of your startup code so we can look for race conditions?

For example, normally, you'd call

  • qz.websocket.connect() -- then --
  • qz.printers.find(...) -- then --
  • qz.print(...)

If not using a proper promise chain, you can run into race considtions.

The promise chain is historically done using Promise.then(Promise.then(...)) however the await keyword will work as well if placed inside an async function.

@tresf thanks for the reply. Yeah, I think I may have figured it out. I was doing those actions in the correct order, and properly waiting for the promises to resolve before executing the next one, however I was also re-executing all the startup/config methods (e.g. qz.security.setSignatureAlgorithm, setCertificatePromise, setSignaturePromise, etc) before every new print action/command i sent to qz, as part of my logic to make sure it was connected before running the command.

I modified it to make sure those config methods are only set once on startup, and the issue hasn't come back since.

tresf commented 4 years ago

re-executing all the startup/config (e.g. qz.security.setSignatureAlgorithm, setCertificatePromise, setSignaturePromise, etc) before every new print action/command

Examining the code, I don't see any which alters the connection object in either function however, we do use compatible in setSignaturePromise to warn about hash conflicts. That might be what's happening. Pinging @bberenz to see if we have an unchecked call.

tresf commented 4 years ago

I can see where this would occur. We check for active connection, but don't guard against a connection which hasn't replied yet, so there's a chance connection is active, but version is null. This will happen if the signature API is configured mid-connection. This isn't really supported (API calls should happen first), but I think we can guard against it.

tresf commented 4 years ago

PR opened in #656 which offers a sanitization around the problematic algorithm call as well as a better error message.

tresf commented 4 years ago

As it turns out there may still be a bug with the version logic. Tracking here: #538