umccr / cwl-ica

A collection of cwl-ica workflows along with a user guide for the commands to use and contributions guide
MIT License
8 stars 2 forks source link

New Dragen reference #147

Closed skanwal closed 1 year ago

skanwal commented 2 years ago

Dragen has released new reference genome files and recommends using these for Dragenv3.9 and onwards https://sapac.illumina.com/science/genomics-research/articles/dragen-demystifying-reference-genomes.html (courtesy Swetansu).

These can be downloaded from https://sapac.support.illumina.com/downloads/dragen-reference-genomes-hg38.html. This will need to go through evaluation and comparison.

alexiswl commented 2 years ago

@swetansuILMN from the sapac link above, it appears there isn't a non-graph based version of the reference for v3.9?

Moving forward with dragen 4.0 (_Illumina DRAGEN Reference Genome hg38 (alt-masked+cnv+hla+rnav2) the link is a 6Gb "bin" file (bash script header with a tar file compressed and embedded below).

 bash hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --help
Makeself version 2.4.3
 1) Getting help or info about hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run :
  hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --help   Print this message
  hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --info   Print embedded info : title, default target directory, embedded script ...
  hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --lsm    Print embedded lsm entry (or no LSM)
  hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --list   Print the list of files in the archive
  hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run --check  Checks integrity of the archive

 2) Running hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run :
  hg38-alt-masked-cnv-hla-rna-8-r2.0-1.run [options] [--] [additional arguments to embedded script]
  with following options (in that order)
  --confirm             Ask before running embedded script
  --quiet               Do not print anything except error messages
  --accept              Accept the license
  --noexec              Do not run embedded script (implies --noexec-cleanup)
  --noexec-cleanup      Do not run embedded cleanup script
  --keep                Do not erase target directory after running
                        the embedded script
  --noprogress          Do not show the progress during the decompression
  --nox11               Do not spawn an xterm
  --nochown             Do not give the target folder to the current user
  --chown               Give the target folder to the current user recursively
  --nodiskspace         Do not check for available disk space
  --target dir          Extract directly to a target directory (absolute or relative)
                        This directory may undergo recursive chown (see --nochown).
  --tar arg1 [arg2 ...] Access the contents of the archive through the tar command
  --ssl-pass-src src    Use the given src as the source of password to decrypt the data
                        using OpenSSL. See "PASS PHRASE ARGUMENTS" in man openssl.
                        Default is to prompt the user to enter decryption password
                        on the current terminal.
  --cleanup-args args   Arguments to the cleanup script. Wrap in quotes to provide
                        multiple arguments.
  --                    Following arguments will be passed to the embedded script

Having a look at the contents of the tarball, its essentially a tar-bomb.

Target directory: .
-rw-r--r-- ihurst/forest_users 1394472 2022-06-08 10:06 ./anchored_hla/hash_table.cfg
-rw-r--r-- ihurst/forest_users  293209 2022-06-08 10:06 ./anchored_hla/hash_table.cfg.bin
-rw-r--r-- ihurst/forest_users 66390032 2022-06-08 10:06 ./anchored_hla/hash_table.cmp
-rw-r--r-- ihurst/forest_users    18432 2022-06-08 10:06 ./anchored_hla/hash_table_stats.txt
-rw-r--r-- ihurst/forest_users   494336 2022-06-08 10:06 ./anchored_hla/ref_index.bin
-rw-r--r-- ihurst/forest_users 15818752 2022-06-08 10:06 ./anchored_hla/reference.bin
-rw-r--r-- ihurst/forest_users  3954688 2022-06-08 10:06 ./anchored_hla/repeat_mask.bin
-rw-r--r-- ihurst/forest_users   561952 2022-06-08 10:06 ./anchored_hla/str_table.bin
-rw-r--r-- ihurst/forest_users   623563 2022-06-08 10:06 ./anchored_rna/hash_table.cfg
-rw-r--r-- ihurst/forest_users   149057 2022-06-08 10:06 ./anchored_rna/hash_table.cfg.bin
-rw-r--r-- ihurst/forest_users 1581923491 2022-06-08 10:06 ./anchored_rna/hash_table.cmp
-rw-r--r-- ihurst/forest_users       7501 2022-06-08 10:06 ./anchored_rna/hash_table_stats.txt
-rw-r--r-- ihurst/forest_users      92074 2022-06-08 10:06 ./anchored_rna/mask.bed
-rw-r--r-- ihurst/forest_users   49238272 2022-06-08 10:06 ./anchored_rna/ref_index.bin
-rw-r--r-- ihurst/forest_users 1575622656 2022-06-08 10:06 ./anchored_rna/reference.bin
-rw-r--r-- ihurst/forest_users  393905664 2022-06-08 10:06 ./anchored_rna/repeat_mask.bin
-rw-r--r-- ihurst/forest_users  111013696 2022-06-08 10:06 ./anchored_rna/str_table.bin
-rw-r--r-- ihurst/forest_users     623522 2022-06-08 10:06 ./hash_table.cfg
-rw-r--r-- ihurst/forest_users     149005 2022-06-08 10:06 ./hash_table.cfg.bin
-rw-r--r-- ihurst/forest_users 4417420773 2022-06-08 10:06 ./hash_table.cmp
-rw-r--r-- ihurst/forest_users      14441 2022-06-08 10:06 ./hash_table_stats.txt
-rw-rw-r-- ihurst/forest_users        883 2022-06-15 04:51 ./info.json
-rw-r--r-- ihurst/forest_users  402276092 2022-06-08 10:06 ./kmer_cnv.bin
-rw-r--r-- ihurst/forest_users      92074 2022-06-08 10:06 ./mask.bed
-rw-rw-r-- ihurst/forest_users       1597 2022-06-15 04:51 ./md5sum
-rw-r--r-- ihurst/forest_users   49238272 2022-06-08 10:06 ./ref_index.bin
-rw-r--r-- ihurst/forest_users 1575622656 2022-06-08 10:06 ./reference.bin
-rw-r--r-- ihurst/forest_users  393905664 2022-06-08 10:06 ./repeat_mask.bin
-rw-r--r-- ihurst/forest_users  111013696 2022-06-08 10:06 ./str_table.bin

In the past we've placed these under a directory that has the 'hg38' keyword so that if the vcf is uploaded to VI, VI will know which reference to use. We also rely on the compressed tarball having a stem identical to the directory it compacts.

There should be a GH issue around for this somewhere, for now the reference is under "Adjust DRAGEN's reference name' in Trello. Conversation around this from the UMCCR-ILLUMINA minutes, this was discussed between Feb/May of 2020.

The info.json file reads as below:

{
  "hashtable_version": "8",
  "components": {
    "alt_aware": false,
    "alt_masked": true,
    "cnv": true,
    "graph": false,
    "hla": true,
    "methylation": false,
    "rna": true
  },
  "digests": {
    "digest_type": "1",
    "digest": "0xD8BA5C99",
    "ref_digest": "0x20E4586C",
    "ref_index_digest": "0x1EC9AAE2",
    "hash_digest": "0x612FDED3",
    "liftover_digest": "0x00000000",
    "extend_table_digest": "0x37DC7A78",
    "masked_ref_digest": "0xB902C17B",
    "mask_bed_digest": "0xEDB1FFA7"
  },
  "package_info": {
    "filename": "hg38+alt_masked+cnv+hla+rna-8-r2.0-1.run",
    "name": "hg38+alt_masked+cnv+hla+rna-8-r2.0",
    "stem": "hg38",
    "components": "+alt_masked+cnv+hla+rna",
    "version": "r2.0",
    "format": "1",
    "fasta": "hg38.fa",
    "description": "hg38+alt_masked+cnv+hla+rna-8-r2.0",
    "build_timestamp": 1655232481
  }
}

Which suggests we ought to name the tarball hg38+alt_masked+cnv+hla+rna-8-r2.0. Having + is a bit icky but I don't think it's going to cause any issues.

Hence we can create a cwltool that:

  1. Takes the bin file as input
  2. Takes the directory name (file stem) as input
  3. Executes the bin file with the --target parameter set as the directory name input
  4. Tars up the output directory with standard gzip compression
  5. And globs .tar.gz as the File output.
swetansuILMN commented 2 years ago

@alexiswl is it reasonable to assume that the current hg38 reference ( hg38-v8-altaware-cnv-anchored for DRAGENv3.9.3) used by the UMCCR somatic T/N pipeline a graph genome?

alexiswl commented 2 years ago

@alexiswl is it reasonable to assume that the current hg38 reference ( hg38-v8-altaware-cnv-anchored for DRAGENv3.9.3) used by the UMCCR somatic T/N pipeline a graph genome?

As discussed in the meeting today, no we're using the non-graph version of hg38-v8-altaware-cnv-anchored from https://s3.amazonaws.com/use1-prd-seq-hub-appdata/Edico_v8/hg38_altaware-cnv-anchored.v8.tar

alexiswl commented 1 year ago

Resolved by https://github.com/umccr/cwl-ica/pull/160