icgc-argo / workflow-roadmap

Roadmap and management for genomic data processing
GNU Affero General Public License v3.0
1 stars 0 forks source link

🐛 PublishAnalysis in intermediate-song hangs and results in a 504 Gateway Time-out error #101

Closed hknahal closed 3 years ago

hknahal commented 3 years ago

Describe the bug

The PublishAnalysis endpoint on intermediate-song (https://intermediate-song.rdpc.cancercollaboratory.org/swagger-ui.html#/Analysis/publishAnalysisUsingPUT) hangs and results in a 504 Gateway Time-out error. However, the SONG payload does seem to get published (after checking using the ReadAnalysis endpoint) but there is no confirmation of this after running the PublishAnalysis endpoint.

This issue is related to https://github.com/icgc-argo/workflow-roadmap/issues/97 which outlines the steps involved in un-suppressing the OCCAMS-GB payloads Jon fixed earlier. The last step involves publishing the payload and this is where the error occurs. I ran the same steps on intermediate-song.rdpc-qa and was able to publish there successfully, so this error is restricted to intermediate-song.rdpc only.

Steps To Reproduce

You can try these steps out to see the error (if you need more analysis IDs to test, let me know)

  1. Check status of Analysis ID 52f6c8b8-970f-462a-b6c8-b8970fc62a18: curl -X GET "https://intermediate-song.rdpc.cancercollaboratory.org/studies/OCCAMS-GB/analysis/52f6c8b8-970f-462a-b6c8-b8970fc62a18" -H "accept: */*"

It is currently SUPPRESSED

{
  "analysisId": "52f6c8b8-970f-462a-b6c8-b8970fc62a18",
  "studyId": "OCCAMS-GB",
  "analysisState": "SUPPRESSED",
  "createdAt": "2020-12-11T19:35:42.567585",
  "updatedAt": "2021-02-19T15:39:14.206356",
  "firstPublishedAt": null,
  "publishedAt": null,
  "analysisStateHistory": [],
  "samples": [
    {
      "sampleId": "SA597172",
      "specimenId": "SP203381",
      "submitterSampleId": "LP6008141-DNA_C01",
      "matchedNormalSubmitterSampleId": "LP6008138-DNA_C01",
      "sampleType": "Total DNA",
      "specimen": {
        "specimenId": "SP203381",
        "donorId": "DO234207",
        "submitterSpecimenId": "edf7f7a63d4095d6a06cef23060204bba332c6e478a9f348389469206ca8f0bc",
        "tumourNormalDesignation": "Tumour",
        "specimenTissueSource": "Solid tissue",
        "specimenType": "Primary tumour"
      },
      "donor": {
        "donorId": "DO234207",
        "studyId": "OCCAMS-GB",
        "gender": "Female",
        "submitterDonorId": "e28ac921f54f5818462d387b9c61bba5f72d25451cbc98f5e1804f741c1f1fe7"
      }
    }
  ],
  "files": [
    {
      "info": {
        "data_category": "Sequencing Reads",
        "legacyAnalysisId": "EGAR00001370192"
      },
      "objectId": "8723eff4-777b-5256-96fb-1dded2a9b619",
      "studyId": "OCCAMS-GB",
      "analysisId": "52f6c8b8-970f-462a-b6c8-b8970fc62a18",
      "fileName": "b37916947041eb164349904689cfe75c.LP6008141-DNA_C01.bam",
      "fileSize": 93928362007,
      "fileType": "BAM",
      "fileMd5sum": "b37916947041eb164349904689cfe75c",
      "fileAccess": "controlled",
      "dataType": "Submitted Reads"
    }
  ],
  "analysisType": {
    "name": "sequencing_experiment",
    "version": 6
  },
  "experiment": {
    "platform": "ILLUMINA",
    "platform_model": null,
    "sequencing_date": null,
    "sequencing_center": null,
    "experimental_strategy": "WGS",
    "submitter_sequencing_experiment_id": "EXP-433"
  },
  "read_groups": [
    {
      "file_r1": "b37916947041eb164349904689cfe75c.LP6008141-DNA_C01.bam",
      "file_r2": "b37916947041eb164349904689cfe75c.LP6008141-DNA_C01.bam",
      "insert_size": null,
      "library_name": "LP6008141-DNA_C01",
      "is_paired_end": true,
      "platform_unit": "LP6008141-DNA_C01",
      "read_length_r1": null,
      "read_length_r2": null,
      "sample_barcode": null,
      "read_group_id_in_bam": null,
      "submitter_read_group_id": "LP6008141-DNA_C01"
    }
  ],
  "read_group_count": 1
}
  1. Publish the Analysis using this endpoint: curl -X PUT "https://intermediate-song.rdpc.cancercollaboratory.org/studies/OCCAMS-GB/analysis/publish/52f6c8b8-970f-462a-b6c8-b8970fc62a18?ignoreUndefinedMd5=false" -H "accept: */*" -H "Authorization: Bearer <your_token>"

It hangs for about 2 minutes before erroring out with this message:

<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>openresty/1.15.8.2</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
  1. Check the Analysis ID: 52f6c8b8-970f-462a-b6c8-b8970fc62a18: curl -X GET "https://intermediate-song.rdpc.cancercollaboratory.org/studies/OCCAMS-GB/analysis/52f6c8b8-970f-462a-b6c8-b8970fc62a18" -H "accept: */*"

It will now say "analysisState": "PUBLISHED", even though the PublishAnalysis endpoint returned a 504 Gateway Error. Normally the PublishAnalysis endpoint returns the published payload, which indicates the status as PUBLISHED.

rosibaj commented 3 years ago

@hknahal can you try this again with a new payload? Dusan had made some adjustments in the deployment, and i would like to confirm if this is still happening to get this ticket moving!

hknahal commented 3 years ago

@rosibaj I tried to publish a payload but it still gives the same 504 error. It hangs for a few minutes before erroring out.

rosibaj commented 3 years ago

@hknahal are there any logs in your Song client that are relevant to this issue?

andricDu commented 3 years ago

Kafka connection seems to be broken. Will investigate.

2021-03-09 14:21:08,390 [http-nio-8080-exec-10] INFO b.o.s.s.v.SchemaValidator -
2021-03-09 14:21:08,477 [http-nio-8080-exec-8] ERROR o.s.k.s.LoggingProducerListener - Exception thrown when sending a message with key='null' and payload='{"analysisId":"52f6c8b8-970f-462a-b6c8-b8970fc62a18","studyId":"OCCAMS-GB","state":"PUBLISHED","acti...' to topic song_analysis:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
2021-03-09 14:22:08,468 [http-nio-8080-exec-10] ERROR o.s.k.s.LoggingProducerListener - Exception thrown when sending a message with key='null' and payload='{"analysisId":"52f6c8b8-970f-462a-b6c8-b8970fc62a18","studyId":"OCCAMS-GB","state":"PUBLISHED","acti...' to topic song_analysis:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
rosibaj commented 3 years ago

@hknahal we asked you in standup to try publishing an analysis again - can you post the results here?

hknahal commented 3 years ago

@rosibaj I tried publishing the analysis ID cf370c5f-ea9c-47ec-b70c-5fea9c47ecdf for the OCCAMS-GB program, but it still resulted in the same 504 error (after hanging for a few minutes). I tried publishing using the SONG API endpoint (https://intermediate-song.rdpc.cancercollaboratory.org/swagger-ui.html#/Analysis/publishAnalysisUsingPUT).

rosibaj commented 3 years ago

@andricDu anymore logs from Hardeeps latest publish?

andricDu commented 3 years ago

@hknahal I've narrowed it down to a missing network policy and its been fixed. Could you please try again.

hknahal commented 3 years ago

@andricDu @rosibaj It worked this time! I was able to successfully publish an analysis without it hanging and returning a 504 error.

Result: AnalysisId 5855f1bc-3ea0-4fc7-95f1-bc3ea0ffc7bf successfully published

@rosibaj I'll go ahead and do the same for the remaining analyses.

hknahal commented 3 years ago

@rosibaj I published all the previously suppressed OCCAMS-GB analyses in intermediate-song: https://github.com/icgc-argo/workflow-roadmap/issues/97

rosibaj commented 3 years ago

@hknahal i will close this issue as the network policy has been fixed now.