Closed rdmorin closed 11 months ago
@lkhilton noted that the error refers to a file that doesn't exist but the indexed file should exist and if it doesn't, it should be created. Someone will need to determine if it's just looking for a file that has a different name now, in which case this fix may just involve an update to the config or minor code change.
Thanks, I will look into this!
After looking into this, there are some discrepancies with the name of the file the function looks for (with mode = strelka2) and what's available under:
/projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/strelka-1.1_vcf2maf-1.2/level_3/
The function is looking for the following file;
final_merged_grch37.maf.bgz
and these are the closest relatives to that file;
final_merged_grch37.bed.gz
final_merged_grch37.bed.gz.tbi
final_merged_grch37.maf
Would we want to convert the available .gz
files for strelka2 output to .bgz
. If not, I could update this function to read from the available .gz
file (if mode = "strelka2"). i.e:
if(mode == "slims-3"){
full_maf_path_comp = paste0(base_path, maf_path, ".bgz")
}else if(mode == "strelka2"){
full_maf_path_comp = gsub("maf", "bed", full_maf_path_comp)
full_maf_path_comp = paste0(base_path, maf_path, ".gz")
}
It has to be a tabix-indexed and bgzipped file (but those can and usually do have .gz suffixes). You should switch to the .gz file if you can confirm it really is compressed with bgzip.
Thanks, I'll do some tests.
When running some tests, I updated the function to read from the .gz bed file and updated the tabix command to not select any specific columns, just to make sure I got everything back. The returned object only seems to have 4 columns (the filtering on region works). Though the column names are inferred from the standard maf columns, they seem to be chromosome
, start
, end
, sample_id
, so pretty much the obligatory BED columns.
Here's the head of the returned object:
> head(muts_region)
# A tibble: 6 × 4
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build
<chr> <int> <chr> <chr>
1 8 128723209 128723209 OCI-Ly3
2 8 128723217 128723217 BLGSP-71-30-00642-01A-01E
3 8 128723276 128723276 BLGSP-71-17-00357-01B-09E
4 8 128723405 128723405 09-12864T
5 8 128723409 128723409 BLGSP-71-30-00778-01A-01E
6 8 128723410 128723410 02-15630_tumorA
> region
[1] "8:128723128-128774067"
The uncompressed maf file looks fine at a glance. So here's what I am proposing. Let's compress the maf file with bgzip and then index it with tabix. The function can then be updated to read from this file when the mode is set to strelka2. I think this makes sense since when the mode is the default (slims-3) it reads from a bgzipped and tabix-indexed maf file. not bed suffix.
Thoughts?
It will have to be sorted before compressing it otherwise tabix will refuse to index it. Whether this is worth doing really depends on whether or not we need more details for these Strelka variants for our needs. @lkhilton what do you think? The Strelka MAFs are way larger so keeping a bgzip-compressed copy will take up some additional space in the GAMBL directories
I'm pretty ambivalent. I think the current situation where there's only a minimal tabix indexed file is fine, so then maybe we just need the path in the config updated and the it's done.