Closed casslitch closed 3 months ago
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit d0a920f
+| ✅ 177 tests passed |+
#| ❔ 5 tests were ignored |#
!| ❗ 15 tests had warnings |!
Thanks for opening the PR! I don't think this affects users where the GRIDSS index was prepared with the builtin functionality but understand the current set up may have issues for externally created GRIDSS indexes that contain clashing file names with other staged files as you've pointed out.
I'm not sure what the best approach is here - only support precisely the expected GRIDSS index fileset or support all/common deviations from these expectations?
Leaning towards the former but will accommodate in the case that you feel other users may experience the same issue. If that's the case, I'd suggest one of two approaches:
find -L <dir> -regex '.*\.\(amb\|ann\|pac\|gridsscache\|sa\|bwa\|img\|alt\)'
ln -sf <args>
While the former requires strong alignment with expected files I'd probably prefer it since I find it safer than forceful replacement. And if making this change it would be good to apply to all instances where the GRIDSS index is used for consistency.
Hi Stephen,
Thanks for looking into this! I agree, this won't affect users who've generated the index from the built-in functionality. My thinking was that other users might be affected if they've pre-generated the gridss index themselves.
I like the first option of explicitly including all index files.
However, based on the current module, the genome_bwa_index folder isn't staged. Therefore only the files in the genome_gridss_index folder will be picked up (I think that's just the img and gridsscache files). Are the bwa index files required for this module? If so, I think 'path genome_bwa_index' needs to be added the inputs.
Are the bwa index files required for this module? If so, I think 'path genome_bwa_index' needs to be added the inputs.
Ah, the reference genome indexes have been rearranged recently and the new GRIDSS index directory contains the following files:
*.gridsscache
*.img
*.amb
, *.ann
, *.pac
, *.sa
, *.bwt
*.alt
Sorry for the confusion!
My thinking was that other users might be affected if they've pre-generated the gridss index themselves. I like the first option of explicitly including all index files.
Okay, let's make this change then. Can you adjust each of the symlink/find commands for the GRIDSS index directory on your PR branch according to the first option above and also change the merge PR base branch to dev
? Once merged, I'll cherry-pick your commits into features-block-one-preview
.
Thanks Stephen, sounds good! Thanks for describing the new structure. Just to clarify, should I still add genome_bwa_index to the inputs for this module? For example, if the user has pre-generated the bwa index files and they sit in a different folder to genome_gridss_index, then they won't be staged. I agree this won't affect users who are using the downloaded genome_gridss_index folder which contains all the necessary files.
No, the user will need to place all the BWA index files under the same GRIDSS index directory themselves. I think this is okay since (1) it is required to create the GRIDSS index files anyway, and (2) the BWA index files are not used anywhere else in the workflow.
Once the changes have been made, let's review and test!
Thanks Stephen for explaining, agree with that logic!
I'm going to rebase your commits on top of dev
and then force push - you may need to delete your local branch and fetch it again.
I noticed I made a typo in the find
command for 'bwa', that should be 'bwt' instead. If you can update that I'll test these changes tomorrow.
The 'Run pipeline stubs' check is showing that the new back-slashes need to be escaped:
find -L ${genome_gridss_index} -regex '.*\\.\\(amb\\|ann\\|pac\\|gridsscache\\|sa\\|bwt\\|img\\|alt\\)'
Okay, changes now look good.
I reproduced the issue with the dev
branch using simulated WGS t/n data then successfully tested these changes for the GRCh38_hmf reference genome with both the default GRIDSS index directory and a 'custom' one that contains .{fai,dict}
files.
genome_gridss_index folder may contain additional files that are staged by nextflow if they are part of the inputs (e.g. someone might have the .dict file in this folder if they created it via gridss.PrepareReference). This causes an error when we try to symlink the entire contents of the folder. Suggestion: symlink only specific files of interest.