Open edsu7 opened 2 years ago
Adding some additional context on view
The problem :
score-client view
outputs only header for ARGO files but is fine for 25k
Context:
score-client view
is for users who want to subset or have a partial view of the sequencing data.
Generally speaking, the sequencing data takes the form of a SAM
(sequencing alignment mapping file/BAM
(binary alignment mapping file)/CRAM
(not sure what this one stands for).
All three have the same contents and structure but differ in that SAM
is a text format, BAM
is binary, and CRAM is binary but does not carry the sequence string instead utilizes an index.
Structure wise SAM
files have a header (table of contents) and body (the actual sequences).
Example:
Configuration : score-client info
Active Configuration:
Profile: default
Storage URL: https://storage.cancercollaboratory.org
Metadata URL:https://song.cancercollaboratory.org
Querying for header : score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --header-only | wc -l
106
Querying for body and header : score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query 1:1-100000 | wc -l
31056
Configuration : score-client info
Active Configuration:
Profile: default
Storage URL: https://api.platform.icgc-argo.org/storage-api
Metadata URL:https://api.platform.icgc-argo.org/storage-api
Querying for header : score-client view --object-id 3f242e1b-7c11-5802-a5dd-d8cca922efca --header-only | wc -l
3372
Querying for body and header : score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query chr1:1-100000 | wc -l
3372
score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query 1:1-100000 | wc -l
3372
Note that line count doesn't change.
Adding some additional context on mount
Problem:
score-client mount
errors out when mounting ARGO files but okay for 25k.
Example:
Configuration : score-client info
Active Configuration:
Profile: default
Storage URL: https://storage.cancercollaboratory.org
Metadata URL:https://song.cancercollaboratory.org
manifest contents:
repo_code file_id object_id file_format file_name file_size md5_sum index_object_id donor_id/donor_count project_id/project_count study
collaboratory FI9994 ace274bb-059e-55f2-875d-56c18705fe41 BAM 41495b5561fb524ca929cdffb5d77d95.bam 107272565465 41495b5561fb524ca929cdffb5d77d95 f0f9b033-7b72-5e49-8fdd-459cd54a212a DO217962 BRCA-EU PCAWG
Querying for header : score-client mount --mount-point output_dir --cache-metadata --manifest 25k_manifest.tsv
[4] Applying manifest view:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
<object id>: <gnos id>/<file name> @ <file size>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
- ace274bb-059e-55f2-875d-56c18705fe41: 3b197303-1892-423d-815e-e19b241e80dc/41495b5561fb524ca929cdffb5d77d95.bam @ 99.9 G
- f0f9b033-7b72-5e49-8fdd-459cd54a212a: 3b197303-1892-423d-815e-e19b241e80dc/41495b5561fb524ca929cdffb5d77d95.bam.bai @ 14.1 M
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total count: 2, Total size: 99.9 G
Successfully mounted file system at /home/ubuntu/downloads/score-client-5.8.1/hello and is now ready for use.
Shut down mount after 5.902 s with a total of 0 connects and 0 B bytes read.
Configuration : score-client info
Active Configuration:
Profile: default
Storage URL: https://api.platform.icgc-argo.org/storage-api
Metadata URL:https://api.platform.icgc-argo.org/storage-api
manifest contents:
repository_code analysis_id object_id file_type file_name file_size md5sum index_object_id donor_id sample_id(s) program_id
song.collab ff3d425f-44ea-4765-bd42-5f44ea0765e5 3f242e1b-7c11-5802-a5dd-d8cca922efca CRAM OCCAMS-GB.DO234195.SA597244.wgs.20210408.aln.cram 81352139083 c4b2998b15406f66d3e9711e482dd566 fedebb35-752c-5c14-8ed7-bb2ee8398e0f DO234195 SA597244 OCCAMS-GB
Querying for header : score-client mount --mount-point output_dir --cache-metadata --manifest argo_manifest.tsv
ERROR: Command error: bio.overture.score.client.metadata.EntityNotFoundException: I/O error on GET request for "https://api.platform.icgc-argo.org/storage-api/entities/": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out
Please check the log for detailed error messages
Moving mount
:
https://github.com/icgc-argo/workflow-roadmap/issues/229#issuecomment-1424556203
into separate ticket:
https://github.com/icgc-argo/workflow-roadmap/issues/326
Possible cause and context: score has code specific to 25K, need to figure out if they are meant to run or is running correctly.
This is accessed through gateway, so it might be mishandling the requests. Need to check and understand where and how the requests are made.
Containers with appropriate permissions and tokens:
docker run -d -it \
--name dcc-score-client \
-e ACCESSTOKEN=${DCC_ACCESS_TOKEN} \
-e STORAGE_URL=https://storage.cancercollaboratory.org \
-e METADATA_URL=https://song.cancercollaboratory.org \
ghcr.io/overture-stack/score
docker run -d -it \
--name argo-score-client \
-e ACCESSTOKEN=${ARGO_ACCESS_TOKEN} \
-e STORAGE_URL=https://api.platform.icgc-argo.org/storage-api \
-e METADATA_URL=https://api.platform.icgc-argo.org/storage-api \
ghcr.io/overture-stack/score
View files body + header:
docker exec dcc-score-client sh -c "score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --query 1:1-10000"
docker exec argo-score-client sh -c "score-client view --object-id de36d744-e0c6-5f96-a80b-9005e4f69d53 --query chr1:1-10000"
View files header only:
docker exec dcc-score-client sh -c "score-client view --object-id ace274bb-059e-55f2-875d-56c18705fe41 --header-only"
docker exec argo-score-client sh -c "score-client view --object-id de36d744-e0c6-5f96-a80b-9005e4f69d53 --header-only"
Read score-client logs:
docker exec argo-score-client sh -c "cat logs/client.log"
docker exec dcc-score-client sh -c "cat logs/client.log"
This might be a good place to start: https://github.com/overture-stack/score/blob/75253a671778df882fd2508ff35768b15224df39/score-client/src/main/java/bio/overture/score/client/slicing/SamFileBuilder.java#L239
As observed earlier running:
docker exec argo-score-client sh -c "score-client view --object-id de36d744-e0c6-5f96-a80b-9005e4f69d53 --header-only"
yields the same output log as header only
2-23 21:22:01,003 [main] INFO b.o.s.c.ClientMain - Started ClientMain in 1.957 seconds (JVM running for 2.866)
2023-02-23 21:22:01,016 [main] INFO session - ***** Beginning view session
2023-02-23 21:22:04,446 [main] INFO b.o.s.c.c.ViewCommand - Constructed SamFileBuilder: SamFileBuilder [containedOnly=false, useOriginalHeader=false, outputFormat=SAM, query=[chr1:1-10000], outputDir=null, outputIndex=false, bedFile=null, session=Logger[session], entity=null, samInputResource=data=SEEKABLE_STREAM:bio.overture.score.client.transport.NullSourceSeekableHTTPStream@5b56b654;index=null, queryCompiledFlag=false]
2023-02-23 21:22:04,446 [main] WARN b.o.s.c.c.ViewCommand - Supplied query or bedfile will not be used since no index is available
2023-02-23 21:22:04,451 [main] INFO session - Adding APGI-AU.DO34584.SA410795.wgs.20210616.aln.header.sam
2023-02-23 21:22:04,451 [main] INFO session - Preparing to write header only to stdout
2023-02-23 21:22:04,604 [main] INFO session - Done
@edsu7 I think this issue has something to do the absence of reference files.
The reason this is occuring for specifically cram files- 1) Index file url was missing 2) A Reference file(stored on the client end) needs to be supplied via the command line arguments.
@lindaxiang @edsu7 Could you please test in QA? Thank you!
@Buwujiu and @dahiyaAD
Tested on QA and Prod data, both had the same findings
Needs the following fixes:
reference-file
not enforced when CRAM
is detected (test 3)docker container on prod data:
docker run -d -it \
--name prod-score-client \
-e ACCESSTOKEN=${token} \
-e STORAGE_URL=https://api.platform.icgc-argo.org/storage-api \
-e METADATA_URL=https://api.platform.icgc-argo.org/storage-api \
--mount type=bind,source="$(pwd)",target=/output \
overture/score:5.8.4
expected behaviour aside from weird openJDK warning cmd:
docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea"
log:
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=deny; support was removed in 17.0
Running...Viewing...
Validating repository connection...
3372
Same output as Test 1 cmd:
docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --header-only"
log:
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=deny; support was removed in 17.0
Running...Viewing...
Validating repository connection...
3372
Expected an error to be thrown where score-client query
is invoked without reference-file
and CRAM
is detected. Also a lot debug messages, was a setting left on?
cmd:
docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:10000-20000 | wc -l"
log:
DEBUG 2023-04-20 15:54:04 CompressionHeader FOUND ENCODING: RI_RefId, HUFFMAN, [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2023-04-20 15:54:04 CompressionHeader FOUND ENCODING: SC_SoftClip, BYTE_ARRAY_STOP, [0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2023-04-20 15:54:18 ContainerIO READ CONTAINER: seqID=0, start=9999, span=179550, records=10000, slices=1, blocks=30.
DEBUG 2023-04-20 15:54:19 ContainerParser Adding external data: 30
DEBUG 2023-04-20 15:54:19 ContainerParser Adding external data: 31
DEBUG 2023-04-20 15:54:19 ContainerParser Adding external data: 32
DEBUG 2023-04-20 15:54:20 ContainerParser Slice records read time: 420
ERROR: Command error: Contig chr1 not found in the reference file.
Please check the log for detailed error messages
0
Fix works and CRAM
body is outputted, however still messy due DEBUG messages
cmd:
docker exec prod-score-client sh -c "score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:10000-20000 --reference-file /output/references/Homo_sapiens/GATK/GRCh38/Sequence//WholeGenomeFasta/Homo_sapiens_assembly38.fasta | wc -l"
log:
].
DEBUG 2023-04-20 15:59:28 CompressionHeader FOUND ENCODING: SC_SoftClip, BYTE_ARRAY_STOP, [0, 14, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].
DEBUG 2023-04-20 15:59:43 ContainerIO READ CONTAINER: seqID=0, start=189449, span=592649, records=10000, slices=1, blocks=29.
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 4281155
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 5783898
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 5456218
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 11
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 12
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 13
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 14
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 15
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 16
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 17
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 19
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 20
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 21
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 22
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 7364966
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 23
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 5063514
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 24
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 26
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 27
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 5788483
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 28
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 29
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 30
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 31
DEBUG 2023-04-20 15:59:43 ContainerParser Adding external data: 32
DEBUG 2023-04-20 15:59:43 ContainerParser Slice records read time: 216
7622
No mention of CRAM
specific commands in help
cmd:
docker exec prod-score-client sh -c "score-client view"
log:
ERROR: Bad parameter(s): One of --object-id, --input-file or --manifest must be specified
Tested with the same dataset as Edmund
*.fa.gz
, I am getting errors.
cmd:
score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:10000-20000 --reference-file /reference/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.fa.gz
output:
DEBUG 2023-04-20 18:37:11 ContainerParser Slice records read time: 164
ERROR 2023-04-20 18:37:12 Slice Reference MD5 mismatch for slice 0:9999-189548, ??9\????...91K_?V^?#[
ERROR 2023-04-20 18:37:12 CRAMIterator Reference sequence MD5 mismatch for slice: seq id 0, start 9999, span 179550, expected MD5 2dc79d6f86544ffd6f8ca69ca6349132
ERROR: Command error: Not a valid Unicode code point: 0xFFFFFFBB
Please check the log for detailed error messages
I've downloaded the CRAM file to local, and use the same compressed reference files and `samtools view` works properly
2) If provide reference file in uncompressed version `*.fa`, the CRAM body can be printed.
cmd:
`
score-client view --object-id 1bd03148-7c5e-5a03-8271-a931ed2ab5ea --query chr1:1000-20000 --reference-file /reference/GRCh38_hla_decoy_ebv/GRCh38_hla_decoy_ebv.fa
`
Ideally, score-client view should be able to support both compressed and uncompressed reference files. Most of the cases, reference files will be in compressed version.
Testing complete, feedbacks needed to be addressed: https://github.com/icgc-argo/workflow-roadmap/issues/346. Moving to awaiting release.
Summary
SCORE-CLIENT commands + logs (when command did not execute)
Initialize
Download object_id/manifest
View
Log
Mount
Log