raiden-network / light-client

The Raiden Light Client
https://raiden.network/
MIT License
33 stars 31 forks source link

BF4 gets 500 error while calling payments API #2314

Closed agatsoh closed 3 years ago

agatsoh commented 3 years ago

Description


{
  "task": "<TransferTask: {'from': 0, 'to': 3, 'amount': 1000000000000000, 'lock_timeout': 30, 'identifier': 159274100111000101}>",
  "event": "Task errored",
  "logger": "scenario_player.tasks.base",
  "level": "error",
  "timestamp": "2020-11-04 01:57:11.038213",
  "exception": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/base.py\", line 140, in __call__\n    return_val = self._run(*args, **kwargs)\n  File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/api_base.py\", line 58, in _run\n    f'HTTP status code \"{resp.status_code}\" while fetching {url}. '\nscenario_player.exceptions.RESTAPIStatusMismatchError: HTTP status code \"500\" while fetching http://127.0.0.1:47881/api/v1/payments/0x59105441977ecD9d805A4f5b060E34676F50F806/0x5409D2B144a450C459A86c6352898Ef759EdE198. Expected 2..: <!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta charset=\"utf-8\">\n<title>Error</title>\n</head>\n<body>\n<pre>InternalServerError: Transfer expired.<br> &nbsp; &nbsp;at Object.internalErrorHandler (/lc/index.js:222762:31)<br> &nbsp; &nbsp;at Layer.handle_error (/lc/index.js:47797:5)<br> &nbsp; &nbsp;at trim_prefix (/lc/index.js:47371:13)<br> &nbsp; &nbsp;at /lc/index.js:47340:7<br> &nbsp; &nbsp;at Function.process_params (/lc/index.js:47391:12)<br> &nbsp; &nbsp;at next (/lc/index.js:47331:10)<br> &nbsp; &nbsp;at Layer.handle_error (/lc/index.js:47793:12)<br> &nbsp; &nbsp;at trim_prefix (/lc/index.js:47371:13)<br> &nbsp; &nbsp;at /lc/index.js:47340:7<br> &nbsp; &nbsp;at Function.process_params (/lc/index.js:47391:12)</pre>\n</body>\n</html>\n"
}

stack trace:

"Traceback (most recent call last):\n  
File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/base.py\", line 140, in __call__\n    
return_val = self._run(*args, **kwargs)\n  
File \"/usr/local/lib/python3.7/site-packages/scenario_player/tasks/api_base.py\", line 58, in _run\n    
f'HTTP status code \"{resp.status_code}\" while fetching {url}. '\n

scenario_player.exceptions.RESTAPIStatusMismatchError: HTTP status code \"500\" while fetching http://127.0.0.1:47881/api/v1/payments/0x59105441977ecD9d805A4f5b060E34676F50F806/0x5409D2B144a450C459A86c6352898Ef759EdE198

scenario_player.exceptions.RESTAPIStatusMismatchError: HTTP status code \"500\" while fetching

http://127.0.0.1:47881/api/v1/payments/0x59105441977ecD9d805A4f5b060E34676F50F806/0x5409D2B144a450C459A86c6352898Ef759EdE198

from: 0, to: 3, amount: 1_000_000_000_000_000, lock_timeout: 30

- parallel:
          name: "Make 100 transfers from 0 to 2, 3 and 5 in parallel"
          tasks:
            - serial:
                name: "Make 100 transfers from 0 to 2"
                repeat: 100
                tasks:
                  - transfer: {from: 0, to: 2, amount: 1_000_000_000_000_000, lock_timeout: 30}
            - serial:
                name: "Make 100 transfers from 0 to 3"
                repeat: 100
                tasks:
                  - transfer: {from: 0, to: 3, amount: 1_000_000_000_000_000, lock_timeout: 30} **fails here**
            - serial:
                name: "Make 100 transfers from 0 to 5"
                repeat: 100
                tasks:
                  - transfer: {from: 0, to: 5, amount: 1_000_000_000_000_000, lock_timeout: 30}

scenario-player-run_bf4_multi_payments_same_node_2020-11-04T01:39:53.log

node0 and node3 logs: The node logs are empty and dont really reveal much about the internal server error of the the raiden node. bf4_node0_node3_04thNov2020.tar.gz

Acceptance criteria

Tasks

palango commented 3 years ago

I found this error, I'm just surprised that an expired transfer should crash the node:

127.0.0.1 - - "POST /api/v1/payments/0x59105441977ecD9d805A4f5b060E34676F50F806/0x5409D2B144a450C459A86c6352898Ef759EdE198 HTTP/1.1" 500 742 "-" "python-requests/2.24.0"
InternalServerError: Transfer expired.
    at Object.internalErrorHandler (/lc/index.js:222762:31)
    at Layer.handle_error (/lc/index.js:47797:5)
    at trim_prefix (/lc/index.js:47371:13)
    at /lc/index.js:47340:7
    at Function.process_params (/lc/index.js:47391:12)
    at next (/lc/index.js:47331:10)
    at Layer.handle_error (/lc/index.js:47793:12)
    at trim_prefix (/lc/index.js:47371:13)
    at /lc/index.js:47340:7
    at Function.process_params (/lc/index.js:47391:12)

There's also this higher up, which I cannot decypher:


2020-11-04T01:48:21.228Z [error] Error sending message to global room { roomName: 'raiden_goerli_path_finding' } {
  type: 'PFSCapacityUpdate',
  canonical_identifier: {
    chain_identifier: BigNumber { _hex: '0x05' },
    token_network_address: '0x3EA2a1fED7FdEf300DA19E97092Ce8FdF8bf66A3',
    channel_identifier: BigNumber { _hex: '0x25ce' }
  },
  updating_participant: '0x421231BADBc8aE3aC3bA340E8d00a5ca129c6257',
  other_participant: '0x0b49e040F03a329C8a2e406190a9A9647166c2E0',
  updating_nonce: BigNumber { _hex: '0x03' },
  other_nonce: BigNumber { _hex: '0x00' },
  updating_capacity: BigNumber { _hex: '0x0dd99bb65dd70000' },
  other_capacity: BigNumber { _hex: '0x038d7ea4c68000' },
  reveal_timeout: BigNumber { _hex: '0x14' },
  signature: '0x14994a5654c77acaa899167facfd56b5f25bdd13fc27b887c015ddfd3002b255661b43cf80342406757800b717682d1ff26b8d76c529451072dd9dd0e531b0131c'
} <ref *1> M_UNKNOWN: Internal server error
    at parseErrorResponse (/lc/index.js:110946:13)
    at /lc/index.js:110901:17
    at Request._callback (/lc/index.js:110835:9)
    at Request.self.callback (/lc/index.js:166091:22)
    at Request.emit (events.js:315:20)
    at Request.<anonymous> (/lc/index.js:167060:10)
    at Request.emit (events.js:315:20)
    at IncomingMessage.<anonymous> (/lc/index.js:166982:12)
    at Object.onceWrapper (events.js:421:28)
    at IncomingMessage.emit (events.js:327:22) {
  errcode: 'M_UNKNOWN',
  data: { errcode: 'M_UNKNOWN', error: 'Internal server error' },
  httpStatus: 500,
  event: MatrixEvent {
    event: {
      type: 'm.room.message',
      content: {
        body: '{"type":"PFSCapacityUpdate","canonical_identifier":{"chain_identifier":"5","token_network_address":"0x3EA2a1fED7FdEf300DA19E97092Ce8FdF8bf66A3","channel_identifier":"9678"},"updating_participant":"0x421231BADBc8aE3aC3bA340E8d00a5ca129c6257","other_participant":"0x0b49e040F03a329C8a2e406190a9A9647166c2E0","updating_nonce":"3","other_nonce":"0","updating_capacity":"998000000000000000","other_capacity":"1000000000000000","reveal_timeout":"20","signature":"0x14994a5654c77acaa899167facfd56b5f25bdd13fc27b887c015ddfd3002b255661b43cf80342406757800b717682d1ff26b8d76c529451072dd9dd0e531b0131c"}',
        msgtype: 'm.text'
      },
      event_id: '~!itpaijJrCycYDeCyPj:transport.transport01.raiden.network:m1604454460913.12',
      user_id: '@0x421231badbc8ae3ac3ba340e8d00a5ca129c6257:transport.transport01.raiden.network',
      sender: '@0x421231badbc8ae3ac3ba340e8d00a5ca129c6257:transport.transport01.raiden.network',
      room_id: '!itpaijJrCycYDeCyPj:transport.transport01.raiden.network',
      origin_server_ts: 1604454460913
    },
    sender: RoomMember {
      roomId: '!itpaijJrCycYDeCyPj:transport.transport01.raiden.network',
      userId: '@0x421231badbc8ae3ac3ba340e8d00a5ca129c6257:transport.transport01.raiden.network',
      typing: false,
      name: '@0x421231badbc8ae3ac3ba340e8d00a5ca129c6257:transport.transport01.raiden.network',
      rawDisplayName: '@0x421231badbc8ae3ac3ba340e8d00a5ca129c6257:transport.transport01.raiden.network',
      powerLevel: 0,
      powerLevelNorm: 0,
      user: null,
      membership: null,
      events: { member: null },
      _isOutOfBand: false,
      _modified: 1604454221442
    },
    target: null,
    status: 'not_sent',
    error: [Circular *1],
    forwardLooking: true,
    _pushActions: null,
    _replacingEvent: null,
    _localRedactionEvent: null,
    _isCancelled: false,
    _clearEvent: {},
    _senderCurve25519Key: null,
    _claimedEd25519Key: null,
    _forwardingCurve25519KeyChain: [],
    _untrusted: null,
    _decryptionPromise: null,
    _retryDecryption: false,
    verificationRequest: null,
    _txnId: 'm1604454460913.12'
  }
}
andrevmatos commented 3 years ago

Ok.. this is a warning, not a fatal error, it's catched and doesn't shutdown the SDK.. BUT it does indicate that there was an issue on transport server, which didn't get this message through, and this is probably why something else failed (e.g. another transfer didn't go through because the PFS's view of the capacities were outdated). We need both to check on retrying this request (probably with exponentialBackoff), since we global room message sending isn't expected to fail forever, and also to check what else failed because of it and was not properly handled to end up in a 500 error, instead of a 402 or 409 one.

andrevmatos commented 3 years ago

Reading the source, I can see the 500 was caused by Transfer expiration, as mentioned. Transfer expiration needs to be handled as a failure and raise a proper error code. Besides this, it did fail because on the recipient's side, once again, the matrix server misbehaved when we tried to fetch presence for the initiator, in order to request the secret:

2020-11-04T01:47:49.905Z [info] %c action     color: #03A9F4; font-weight: bold {
  type: 'matrix/presence/failure',
  payload: M_UNKNOWN: Internal server error
      at parseErrorResponse (/lc/index.js:110946:13)
      at /lc/index.js:110901:17
      at Request._callback (/lc/index.js:110835:9)
      at Request.self.callback (/lc/index.js:166091:22)
      at Request.emit (events.js:315:20)
      at Request.<anonymous> (/lc/index.js:167060:10)
      at Request.emit (events.js:315:20)
      at IncomingMessage.<anonymous> (/lc/index.js:166982:12)
      at Object.onceWrapper (events.js:421:28)
      at IncomingMessage.emit (events.js:327:22) {
    errcode: 'M_UNKNOWN',
    data: { errcode: 'M_UNKNOWN', error: 'Internal server error' },
    httpStatus: 500
  },
  meta: { address: '0x421231BADBc8aE3aC3bA340E8d00a5ca129c6257' },
  error: true
}
/sync error M_UNKNOWN: Internal server error
    at parseErrorResponse (/lc/index.js:110946:13)
    at /lc/index.js:110901:17
    at Request._callback (/lc/index.js:110835:9)
    at Request.self.callback (/lc/index.js:166091:22)
    at Request.emit (events.js:315:20)
    at Request.<anonymous> (/lc/index.js:167060:10)
    at Request.emit (events.js:315:20)
    at IncomingMessage.<anonymous> (/lc/index.js:166982:12)
    at Object.onceWrapper (events.js:421:28)
    at IncomingMessage.emit (events.js:327:22) {
  errcode: 'M_UNKNOWN',
  data: [Object],
  httpStatus: 500
}
M_UNKNOWN: Internal server error
    at parseErrorResponse (/lc/index.js:110946:13)
    at /lc/index.js:110901:17
    at Request._callback (/lc/index.js:110835:9)
    at Request.self.callback (/lc/index.js:166091:22)
    at Request.emit (events.js:315:20)
    at Request.<anonymous> (/lc/index.js:167060:10)
    at Request.emit (events.js:315:20)
    at IncomingMessage.<anonymous> (/lc/index.js:166982:12)
    at Object.onceWrapper (events.js:421:28)
    at IncomingMessage.emit (events.js:327:22) {
  errcode: 'M_UNKNOWN',
  data: { errcode: 'M_UNKNOWN', error: 'Internal server error' },
  httpStatus: 500
}

Again, as it should, it was not fatal and just emitted the correct matrix/presence/failure action. Without presence, we can't communicate with them, we can't request for the secret, nor continue transfer's protocol, therefore it rightfully expired on initiator after enough time. We may check on retrying 500 matrix errors on presence fetch as well, or maybe on every matrix API entrypoints.