icgc-argo / workflow-roadmap

Roadmap and management for genomic data processing
GNU Affero General Public License v3.0
1 stars 0 forks source link

🐛Score-client testing : view showing header only #229

Open edsu7 opened 2 years ago

edsu7 commented 2 years ago

Summary

- download works on both object_id and manifest
- View worked but am only getting BAM header back
- Mount ran into a time out

SCORE-CLIENT commands + logs (when command did not execute)

Initialize

argo_accessToken=

docker run \
--name test-env-argo \
-d \
-u $(id -u):$(id -g) \
-it \
 -e "METADATA_URL=https://api.platform.icgc-argo.org/storage-api" \
-e "STORAGE_URL=https://api.platform.icgc-argo.org/storage-api" \
-e ACCESSTOKEN=${argo_accessToken} \
--mount \
type=bind,source="$(pwd)",target=/output overture/score

Download object_id/manifest

docker exec \
test-env-argo \
sh -c \
"bin/score-client \
--profile collab \
download \
--object-id \
d090d394-1878-5db6-a82f-f7d8e2c57455 \
--output-dir /output"

docker exec test-env-argo sh -c "bin/score-client \
--profile collab \
download \
--manifest /output/score-manifest.20220217181236.tsv \
--output-dir /output"

View

docker exec test-env-argo sh -c "bin/score-client view --object-id d264263e-c663-596d-bed1-49b3448d8b7b --query 1:1-10000"
Log
Works but only outputs header?

Mount

docker exec test-env-argo sh -c "bin/score-client mount --mount-point /output/icgc --cache-metadata --manifest /output/score-manifest-bam.tsv"
Log
ERROR: Command error: bio.overture.score.client.metadata.EntityNotFoundException: I/O error on GET request for "https://api.platform.icgc-argo.org/storage-api/entities/": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out

Please check the log for detailed error messages
edsu7 commented 1 year ago

Adding some additional context on view

The problem : score-client view outputs only header for ARGO files but is fine for 25k

Context: score-client view is for users who want to subset or have a partial view of the sequencing data.

Generally speaking, the sequencing data takes the form of a SAM(sequencing alignment mapping file/BAM (binary alignment mapping file)/CRAM (not sure what this one stands for).

All three have the same contents and structure but differ in that SAM is a text format, BAM is binary, and CRAM is binary but does not carry the sequence string instead utilizes an index.

Structure wise SAM files have a header (table of contents) and body (the actual sequences).

Example:

Configuration : score-client info

  Active Configuration: 
    Profile:          default
    Storage URL: https://storage.cancercollaboratory.org
    Metadata URL:https://song.cancercollaboratory.org

Querying for header : score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --header-only | wc -l

106

Querying for body and header : score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query 1:1-100000 | wc -l

31056

Configuration : score-client info

  Active Configuration: 
    Profile:          default
    Storage URL: https://api.platform.icgc-argo.org/storage-api
    Metadata URL:https://api.platform.icgc-argo.org/storage-api

Querying for header : score-client view --object-id 3f242e1b-7c11-5802-a5dd-d8cca922efca --header-only | wc -l

3372

Querying for body and header : score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query chr1:1-100000 | wc -l

3372

score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query 1:1-100000 | wc -l

3372

Note that line count doesn't change.

edsu7 commented 1 year ago

Adding some additional context on mount

Problem: score-client mount errors out when mounting ARGO files but okay for 25k.

Example:

Configuration : score-client info

  Active Configuration: 
    Profile:          default
    Storage URL: https://storage.cancercollaboratory.org
    Metadata URL:https://song.cancercollaboratory.org

manifest contents:

repo_code   file_id object_id   file_format file_name   file_size   md5_sum index_object_id donor_id/donor_count    project_id/project_count    study
collaboratory   FI9994  ace274bb-059e-55f2-875d-56c18705fe41    BAM 41495b5561fb524ca929cdffb5d77d95.bam    107272565465    41495b5561fb524ca929cdffb5d77d95    f0f9b033-7b72-5e49-8fdd-459cd54a212a    DO217962    BRCA-EU PCAWG

Querying for header : score-client mount --mount-point output_dir --cache-metadata --manifest 25k_manifest.tsv

[4] Applying manifest view:                                                                                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
<object id>: <gnos id>/<file name> @ <file size>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 - ace274bb-059e-55f2-875d-56c18705fe41: 3b197303-1892-423d-815e-e19b241e80dc/41495b5561fb524ca929cdffb5d77d95.bam @ 99.9 G
 - f0f9b033-7b72-5e49-8fdd-459cd54a212a: 3b197303-1892-423d-815e-e19b241e80dc/41495b5561fb524ca929cdffb5d77d95.bam.bai @ 14.1 M
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Total count: 2, Total size: 99.9 G

Successfully mounted file system at /home/ubuntu/downloads/score-client-5.8.1/hello and is now ready for use.                                                          
Shut down mount after 5.902 s with a total of 0 connects and 0 B bytes read.     

Configuration : score-client info

  Active Configuration: 
    Profile:          default
    Storage URL: https://api.platform.icgc-argo.org/storage-api
    Metadata URL:https://api.platform.icgc-argo.org/storage-api

manifest contents:

repository_code analysis_id object_id   file_type   file_name   file_size   md5sum  index_object_id donor_id    sample_id(s)    program_id
song.collab ff3d425f-44ea-4765-bd42-5f44ea0765e5    3f242e1b-7c11-5802-a5dd-d8cca922efca    CRAM    OCCAMS-GB.DO234195.SA597244.wgs.20210408.aln.cram   81352139083 c4b2998b15406f66d3e9711e482dd566    fedebb35-752c-5c14-8ed7-bb2ee8398e0f    DO234195    SA597244    OCCAMS-GB

Querying for header : score-client mount --mount-point output_dir --cache-metadata --manifest argo_manifest.tsv

ERROR: Command error: bio.overture.score.client.metadata.EntityNotFoundException: I/O error on GET request for "https://api.platform.icgc-argo.org/storage-api/entities/": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out

Please check the log for detailed error messages
edsu7 commented 1 year ago

Moving mount: https://github.com/icgc-argo/workflow-roadmap/issues/229#issuecomment-1424556203 into separate ticket: https://github.com/icgc-argo/workflow-roadmap/issues/326

Buwujiu commented 1 year ago

Possible cause and context: score has code specific to 25K, need to figure out if they are meant to run or is running correctly.

This is accessed through gateway, so it might be mishandling the requests. Need to check and understand where and how the requests are made.

edsu7 commented 1 year ago

Containers with appropriate permissions and tokens:

docker run -d -it \
--name dcc-score-client \
-e ACCESSTOKEN=${DCC_ACCESS_TOKEN} \
-e STORAGE_URL=https://storage.cancercollaboratory.org \
-e METADATA_URL=https://song.cancercollaboratory.org \
ghcr.io/overture-stack/score
docker run -d -it \
--name argo-score-client \
-e ACCESSTOKEN=${ARGO_ACCESS_TOKEN} \
-e STORAGE_URL=https://api.platform.icgc-argo.org/storage-api \
-e METADATA_URL=https://api.platform.icgc-argo.org/storage-api \
ghcr.io/overture-stack/score

View files body + header:

docker exec dcc-score-client sh -c "score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query 1:1-10000"
docker exec argo-score-client sh -c "score-client view --object-id de36d744-e0c6-5f96-a80b-9005e4f69d53 --query chr1:1-10000"

View files header only:

docker exec dcc-score-client sh -c "score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --header-only"
docker exec argo-score-client sh -c "score-client view --object-id de36d744-e0c6-5f96-a80b-9005e4f69d53 --header-only"

Read score-client logs:

docker exec argo-score-client sh -c "cat logs/client.log"
docker exec dcc-score-client sh -c "cat logs/client.log"
edsu7 commented 1 year ago

This might be a good place to start: https://github.com/overture-stack/score/blob/75253a671778df882fd2508ff35768b15224df39/score-client/src/main/java/bio/overture/score/client/slicing/SamFileBuilder.java#L239

As observed earlier running:

docker exec argo-score-client sh -c "score-client view --object-id de36d744-e0c6-5f96-a80b-9005e4f69d53 --header-only"

yields the same output log as header only

2-23 21:22:01,003 [main] INFO  b.o.s.c.ClientMain - Started ClientMain in 1.957 seconds (JVM running for 2.866)
2023-02-23 21:22:01,016 [main] INFO  session - ***** Beginning view session
2023-02-23 21:22:04,446 [main] INFO  b.o.s.c.c.ViewCommand - Constructed SamFileBuilder: SamFileBuilder [containedOnly=false, useOriginalHeader=false, outputFormat=SAM, query=[chr1:1-10000], outputDir=null, outputIndex=false, bedFile=null, session=Logger[session], entity=null, samInputResource=data=SEEKABLE_STREAM:bio.overture.score.client.transport.NullSourceSeekableHTTPStream@5b56b654;index=null, queryCompiledFlag=false]
2023-02-23 21:22:04,446 [main] WARN  b.o.s.c.c.ViewCommand - Supplied query or bedfile will not be used since no index is available
2023-02-23 21:22:04,451 [main] INFO  session - Adding APGI-AU.DO34584.SA410795.wgs.20210616.aln.header.sam
2023-02-23 21:22:04,451 [main] INFO  session - Preparing to write header only to stdout
2023-02-23 21:22:04,604 [main] INFO  session - Done
lindaxiang commented 1 year ago

@edsu7 I think this issue has something to do the absence of reference files.

dahiyaAD commented 1 year ago

The reason this is occuring for specifically cram files- 1) Index file url was missing 2) A Reference file(stored on the client end) needs to be supplied via the command line arguments.

Buwujiu commented 1 year ago

@lindaxiang @edsu7 Could you please test in QA? Thank you!

edsu7 commented 1 year ago

@Buwujiu and @dahiyaAD

Tested on QA and Prod data, both had the same findings

Needs the following fixes:

docker container on prod data:

docker run -d -it \
--name prod-score-client \
-e ACCESSTOKEN=${token} \
-e STORAGE_URL=https://api.platform.icgc-argo.org/storage-api \
-e METADATA_URL=https://api.platform.icgc-argo.org/storage-api \
--mount type=bind,source="$(pwd)",target=/output \
overture/score:5.8.4

Test 1 - View only

expected behaviour aside from weird openJDK warning cmd:

docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea"

log:

OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=deny; support was removed in 17.0
Running...Viewing...                                                            
Validating repository connection...
3372 

Test 2 - View w/ Header

Same output as Test 1 cmd:

docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --header-only"

log:

OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=deny; support was removed in 17.0
Running...Viewing...                                                            
Validating repository connection...
3372 

Test 3 - View w/ Header + Query

Expected an error to be thrown where score-client query is invoked without reference-file and CRAM is detected. Also a lot debug messages, was a setting left on? cmd:

docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:10000-20000 | wc -l"

log:

DEBUG   2023-04-20 15:54:04 CompressionHeader   FOUND ENCODING: RI_RefId, HUFFMAN, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG   2023-04-20 15:54:04 CompressionHeader   FOUND ENCODING: SC_SoftClip, BYTE_ARRAY_STOP, [0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG   2023-04-20 15:54:18 ContainerIO READ CONTAINER: seqID=0, start=9999, span=179550, records=10000, slices=1, blocks=30.
DEBUG   2023-04-20 15:54:19 ContainerParser Adding external data: 30
DEBUG   2023-04-20 15:54:19 ContainerParser Adding external data: 31
DEBUG   2023-04-20 15:54:19 ContainerParser Adding external data: 32
DEBUG   2023-04-20 15:54:20 ContainerParser Slice records read time: 420
ERROR: Command error: Contig chr1 not found in the reference file.              

Please check the log for detailed error messages
0

Test 4 - View w/ Header + Query + Reference

Fix works and CRAM body is outputted, however still messy due DEBUG messages cmd:

docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:10000-20000 --reference-file /output/references/Homo_sapiens/GATK/GRCh38/Sequence//WholeGenomeFasta/Homo_sapiens_assembly38.fasta | wc -l"

log:

].
DEBUG   2023-04-20 15:59:28 CompressionHeader   FOUND ENCODING: SC_SoftClip, BYTE_ARRAY_STOP, [0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG   2023-04-20 15:59:43 ContainerIO READ CONTAINER: seqID=0, start=189449, span=592649, records=10000, slices=1, blocks=29.
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 4281155
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 5783898
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 5456218
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 11
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 12
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 13
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 14
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 15
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 16
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 17
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 19
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 20
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 21
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 22
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 7364966
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 23
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 5063514
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 24
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 26
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 27
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 5788483
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 28
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 29
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 30
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 31
DEBUG   2023-04-20 15:59:43 ContainerParser Adding external data: 32
DEBUG   2023-04-20 15:59:43 ContainerParser Slice records read time: 216
7622

Test 5 - View Help

No mention of CRAM specific commands in help cmd:

docker exec prod-score-client sh -c "score-client view"

log:

ERROR: Bad parameter(s): One of --object-id, --input-file or --manifest must be specified
lindaxiang commented 1 year ago

Tested with the same dataset as Edmund

Please check the log for detailed error messages


I've downloaded the CRAM file to local, and use the same compressed reference files and `samtools view` works properly

2) If provide reference file in uncompressed version `*.fa`, the CRAM body can be printed.
cmd:
`
score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:1000-20000 --reference-file /reference/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.fa
`

Ideally, score-client view should be able to support both compressed and uncompressed reference files. Most of the cases, reference files will be in compressed version. 
Buwujiu commented 1 year ago

Testing complete, feedbacks needed to be addressed: https://github.com/icgc-argo/workflow-roadmap/issues/346. Moving to awaiting release.