~Missing hisat2 index file reports an error~ Improve UX for genome concatenation

weir12 commented 1 year ago

Hi, When I run rnaflow, the monitor caught the following error：

Mar-12 14:23:07.942 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'preprocess_illumina:hisat2index'

Caused by:
  Missing output file(s) `reference*.ht2` expected by process `preprocess_illumina:hisat2index`

Command executed:

  hisat2-build -p 1 reference.fa reference

Command exit status:
  0

Command output:
    bucket 6: 10%
    bucket 6: 20%
    bucket 6: 30%
    bucket 6: 40%
    bucket 6: 50%
    bucket 6: 60%
    bucket 6: 70%
    bucket 6: 80%
    bucket 6: 90%
    bucket 6: 100%
    Sorting block of length 627514293 for bucket 6
    (Using difference cover)
    Sorting block time: 00:14:00
  Returning block of 627514294 for bucket 6
  Getting block 7 of 8
    Reserving size (1104693400) for bucket 7
    Calculating Z arrays for bucket 7
    Entering block accumulator loop for bucket 7:
    bucket 7: 10%
    bucket 7: 20%
    bucket 7: 30%
    bucket 7: 40%
    bucket 7: 50%
    bucket 7: 60%
    bucket 7: 70%
    bucket 7: 80%
    bucket 7: 90%
    bucket 7: 100%
    Sorting block of length 759758705 for bucket 7
    (Using difference cover)
    Sorting block time: 00:17:06
  Returning block of 759758706 for bucket 7
  Getting block 8 of 8
    Reserving size (1104693400) for bucket 8
    Calculating Z arrays for bucket 8
    Entering block accumulator loop for bucket 8:
    bucket 8: 10%
    bucket 8: 20%
    bucket 8: 30%
    bucket 8: 40%
    bucket 8: 50%
    bucket 8: 60%
    bucket 8: 70%
    bucket 8: 80%
    bucket 8: 90%
    bucket 8: 100%
    Sorting block of length 836622383 for bucket 8
    (Using difference cover)
    Sorting block time: 00:19:00
  Returning block of 836622384 for bucket 8

Command error:
    Doing ahead-of-time memory usage test
    Passed!  Constructing with these parameters: --bmax 1104693400 --dcv 1024
  Constructing suffix-array element generator
  Converting suffix-array elements to index image
  Allocating ftab, absorbFtab
  Entering GFM loop
  Exited GFM loop
  fchr[A]: 0
  fchr[C]: 1739307686
  fchr[G]: 2940486528
  fchr[T]: 4146834748
  fchr[$]: 5891698134
  Exiting GFM::buildToDisk()
  Returning from initFromVector
  Wrote 1972375105 bytes to primary GFM file: reference.1.ht2l
  Wrote 2945849076 bytes to secondary GFM file: reference.2.ht2l
  Re-opening _in1 and _in2 as input streams
  Returning from GFM constructor
  Returning from initFromVector
  Wrote 2591968457 bytes to primary GFM file: reference.5.ht2l
  Wrote 1499887120 bytes to secondary GFM file: reference.6.ht2l
  Re-opening _in5 and _in5 as input streams
  Returning from HierEbwt constructor
  Headers:
      len: 5891698134
      gbwtLen: 5891698135
      nodes: 5891698135
      sz: 1472924534
      gbwtSz: 1472924534
      lineRate: 7
      offRate: 4
      offMask: 0xfffffffffffffff0
      ftabChars: 10
      eftabLen: 0
      eftabSz: 0
      ftabLen: 1048577
      ftabSz: 8388616
      offsLen: 368231134
      offsSz: 2945849072
      lineSz: 128
      sideSz: 128
      sideGbwtSz: 96
      sideGbwtLen: 384
      numSides: 15342964
      numLines: 15342964
      gbwtTotLen: 1963899392
      gbwtTotSz: 1963899392
      reverse: 0
      linearFM: Yes
  Total time for call to driver() for forward index: 03:15:53

Work dir:
  /cluster/home/ouliang/project/guo/26/914f5ad6e2d0ef2287d9987937a86a

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
Mar-12 14:23:07.952 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Missing output file(s) `reference*.ht2` expected by process `preprocess_illumina:hisat2index`

I checked the working directory of this step and rerun the pipeline with the same parameters. I found that the index file generated by hisat2-build is. ht2l, not. ht2 expected in the nf file. I found that hisat2 will adopt another index suffix for the large genome，My reference genome is Homo sapiens. ”In the case of a large index these suffixes will have a ht2l termination.” https://github.com/DaehwanKimLab/hisat2 Thanks!

weir12 commented 1 year ago

The problem has been solved. I mistakenly understood the meaning of -- autodownload hsa. This option will automatically download the reference genome and annotation file of Homo sapiens, and merge the corresponding Homo sapiens reference genome and annotation file in -- gene fastas.csv -- annotation gtfs.csv in the way of concat, which will directly double the size of the genome (two identical duplicate human genomes)

I suggest whether you can add genome de-duplication steps or warn users that when the workflow is input with two duplicate genome files before the concat genome?

hoelzer commented 1 year ago

Hey @weir12 thanks for reporting! Yes, that is exactly how it works. We can mention that in the README more prominent and also in the --help message.

hoelzer-lab / rnaflow

~Missing hisat2 index file reports an error~ Improve UX for genome concatenation #214