Closed skanwal closed 1 year ago
@swetansuILMN from the sapac link above, it appears there isn't a non-graph based version of the reference for v3.9?
Moving forward with dragen 4.0 (_Illumina DRAGEN Reference Genome hg38 (alt-masked+cnv+hla+rnav2) the link is a 6Gb "bin" file (bash script header with a tar file compressed and embedded below).
bash hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --help
Makeself version 2.4.3
1) Getting help or info about hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run :
hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --help Print this message
hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --info Print embedded info : title, default target directory, embedded script ...
hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --lsm Print embedded lsm entry (or no LSM)
hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --list Print the list of files in the archive
hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --check Checks integrity of the archive
2) Running hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run :
hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run [options] [--] [additional arguments to embedded script]
with following options (in that order)
--confirm Ask before running embedded script
--quiet Do not print anything except error messages
--accept Accept the license
--noexec Do not run embedded script (implies --noexec-cleanup)
--noexec-cleanup Do not run embedded cleanup script
--keep Do not erase target directory after running
the embedded script
--noprogress Do not show the progress during the decompression
--nox11 Do not spawn an xterm
--nochown Do not give the target folder to the current user
--chown Give the target folder to the current user recursively
--nodiskspace Do not check for available disk space
--target dir Extract directly to a target directory (absolute or relative)
This directory may undergo recursive chown (see --nochown).
--tar arg1 [arg2 ...] Access the contents of the archive through the tar command
--ssl-pass-src src Use the given src as the source of password to decrypt the data
using OpenSSL. See "PASS PHRASE ARGUMENTS" in man openssl.
Default is to prompt the user to enter decryption password
on the current terminal.
--cleanup-args args Arguments to the cleanup script. Wrap in quotes to provide
multiple arguments.
-- Following arguments will be passed to the embedded script
Having a look at the contents of the tarball, its essentially a tar-bomb.
Target directory: .
-rw-r--r-- ihurst/forest_users 1394472 2022-06-08 10:06 ./anchored_hla/hash_table.cfg
-rw-r--r-- ihurst/forest_users 293209 2022-06-08 10:06 ./anchored_hla/hash_table.cfg.bin
-rw-r--r-- ihurst/forest_users 66390032 2022-06-08 10:06 ./anchored_hla/hash_table.cmp
-rw-r--r-- ihurst/forest_users 18432 2022-06-08 10:06 ./anchored_hla/hash_table_stats.txt
-rw-r--r-- ihurst/forest_users 494336 2022-06-08 10:06 ./anchored_hla/ref_index.bin
-rw-r--r-- ihurst/forest_users 15818752 2022-06-08 10:06 ./anchored_hla/reference.bin
-rw-r--r-- ihurst/forest_users 3954688 2022-06-08 10:06 ./anchored_hla/repeat_mask.bin
-rw-r--r-- ihurst/forest_users 561952 2022-06-08 10:06 ./anchored_hla/str_table.bin
-rw-r--r-- ihurst/forest_users 623563 2022-06-08 10:06 ./anchored_rna/hash_table.cfg
-rw-r--r-- ihurst/forest_users 149057 2022-06-08 10:06 ./anchored_rna/hash_table.cfg.bin
-rw-r--r-- ihurst/forest_users 1581923491 2022-06-08 10:06 ./anchored_rna/hash_table.cmp
-rw-r--r-- ihurst/forest_users 7501 2022-06-08 10:06 ./anchored_rna/hash_table_stats.txt
-rw-r--r-- ihurst/forest_users 92074 2022-06-08 10:06 ./anchored_rna/mask.bed
-rw-r--r-- ihurst/forest_users 49238272 2022-06-08 10:06 ./anchored_rna/ref_index.bin
-rw-r--r-- ihurst/forest_users 1575622656 2022-06-08 10:06 ./anchored_rna/reference.bin
-rw-r--r-- ihurst/forest_users 393905664 2022-06-08 10:06 ./anchored_rna/repeat_mask.bin
-rw-r--r-- ihurst/forest_users 111013696 2022-06-08 10:06 ./anchored_rna/str_table.bin
-rw-r--r-- ihurst/forest_users 623522 2022-06-08 10:06 ./hash_table.cfg
-rw-r--r-- ihurst/forest_users 149005 2022-06-08 10:06 ./hash_table.cfg.bin
-rw-r--r-- ihurst/forest_users 4417420773 2022-06-08 10:06 ./hash_table.cmp
-rw-r--r-- ihurst/forest_users 14441 2022-06-08 10:06 ./hash_table_stats.txt
-rw-rw-r-- ihurst/forest_users 883 2022-06-15 04:51 ./info.json
-rw-r--r-- ihurst/forest_users 402276092 2022-06-08 10:06 ./kmer_cnv.bin
-rw-r--r-- ihurst/forest_users 92074 2022-06-08 10:06 ./mask.bed
-rw-rw-r-- ihurst/forest_users 1597 2022-06-15 04:51 ./md5sum
-rw-r--r-- ihurst/forest_users 49238272 2022-06-08 10:06 ./ref_index.bin
-rw-r--r-- ihurst/forest_users 1575622656 2022-06-08 10:06 ./reference.bin
-rw-r--r-- ihurst/forest_users 393905664 2022-06-08 10:06 ./repeat_mask.bin
-rw-r--r-- ihurst/forest_users 111013696 2022-06-08 10:06 ./str_table.bin
In the past we've placed these under a directory that has the 'hg38' keyword so that if the vcf is uploaded to VI, VI will know which reference to use. We also rely on the compressed tarball having a stem identical to the directory it compacts.
There should be a GH issue around for this somewhere, for now the reference is under "Adjust DRAGEN's reference name' in Trello. Conversation around this from the UMCCR-ILLUMINA minutes, this was discussed between Feb/May of 2020.
The info.json file reads as below:
{
"hashtable_version": "8",
"components": {
"alt_aware": false,
"alt_masked": true,
"cnv": true,
"graph": false,
"hla": true,
"methylation": false,
"rna": true
},
"digests": {
"digest_type": "1",
"digest": "0xD8BA5C99",
"ref_digest": "0x20E4586C",
"ref_index_digest": "0x1EC9AAE2",
"hash_digest": "0x612FDED3",
"liftover_digest": "0x00000000",
"extend_table_digest": "0x37DC7A78",
"masked_ref_digest": "0xB902C17B",
"mask_bed_digest": "0xEDB1FFA7"
},
"package_info": {
"filename": "hg38+alt_masked+cnv+hla+rna-8-r2.0-1.run",
"name": "hg38+alt_masked+cnv+hla+rna-8-r2.0",
"stem": "hg38",
"components": "+alt_masked+cnv+hla+rna",
"version": "r2.0",
"format": "1",
"fasta": "hg38.fa",
"description": "hg38+alt_masked+cnv+hla+rna-8-r2.0",
"build_timestamp": 1655232481
}
}
Which suggests we ought to name the tarball hg38+alt_masked+cnv+hla+rna-8-r2.0
. Having +
is a bit icky but I don't think it's going to cause any issues.
Hence we can create a cwltool that:
@alexiswl is it reasonable to assume that the current hg38 reference ( hg38-v8-altaware-cnv-anchored for DRAGENv3.9.3) used by the UMCCR somatic T/N pipeline a graph genome?
@alexiswl is it reasonable to assume that the current hg38 reference ( hg38-v8-altaware-cnv-anchored for DRAGENv3.9.3) used by the UMCCR somatic T/N pipeline a graph genome?
As discussed in the meeting today, no we're using the non-graph version of hg38-v8-altaware-cnv-anchored from https://s3.amazonaws.com/use1-prd-seq-hub-appdata/Edico_v8/hg38_altaware-cnv-anchored.v8.tar
Resolved by https://github.com/umccr/cwl-ica/pull/160
Dragen has released new reference genome files and recommends using these for Dragenv3.9 and onwards https://sapac.illumina.com/science/genomics-research/articles/dragen-demystifying-reference-genomes.html (courtesy Swetansu).
These can be downloaded from https://sapac.support.illumina.com/downloads/dragen-reference-genomes-hg38.html. This will need to go through evaluation and comparison.