morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

Strelka2 support in get_ssm_by_region appears to be broken #202

Closed rdmorin closed 11 months ago

rdmorin commented 1 year ago
> strelka_region_ssm = GAMBLR::get_ssm_by_region(region = "chr11:69455000-69459900", streamlined = FALSE, seq_type="genome", from_indexed_flatfile = T, mode = "strelka")
Error in GAMBLR::get_ssm_by_region(region = "chr11:69455000-69459900",  : 
  You requested results from indexed flatfile. The mode should be set to either slms-3 (default) or strelka2. Please specify one of these modes.
> strelka_region_ssm = GAMBLR::get_ssm_by_region(region = "chr11:69455000-69459900", streamlined = FALSE, seq_type="genome", from_indexed_flatfile = T, mode = "strelka2")
[1] "missing: /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/strelka-1.1_vcf2maf-1.2/level_3/final_merged_grch37.maf.bgz"
Error in GAMBLR::get_ssm_by_region(region = "chr11:69455000-69459900",  : 
  failed to find the file needed for this
rdmorin commented 1 year ago

@lkhilton noted that the error refers to a file that doesn't exist but the indexed file should exist and if it doesn't, it should be created. Someone will need to determine if it's just looking for a file that has a different name now, in which case this fix may just involve an update to the config or minor code change.

mattssca commented 1 year ago

Thanks, I will look into this!

mattssca commented 1 year ago

After looking into this, there are some discrepancies with the name of the file the function looks for (with mode = strelka2) and what's available under:

/projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/strelka-1.1_vcf2maf-1.2/level_3/

The function is looking for the following file;

final_merged_grch37.maf.bgz

and these are the closest relatives to that file;

final_merged_grch37.bed.gz final_merged_grch37.bed.gz.tbi final_merged_grch37.maf

Would we want to convert the available .gz files for strelka2 output to .bgz. If not, I could update this function to read from the available .gz file (if mode = "strelka2"). i.e:

  if(mode == "slims-3"){
    full_maf_path_comp = paste0(base_path, maf_path, ".bgz") 
  }else if(mode == "strelka2"){
    full_maf_path_comp = gsub("maf", "bed", full_maf_path_comp)
    full_maf_path_comp = paste0(base_path, maf_path, ".gz") 
  }
rdmorin commented 1 year ago

It has to be a tabix-indexed and bgzipped file (but those can and usually do have .gz suffixes). You should switch to the .gz file if you can confirm it really is compressed with bgzip.

mattssca commented 1 year ago

Thanks, I'll do some tests.

mattssca commented 1 year ago

When running some tests, I updated the function to read from the .gz bed file and updated the tabix command to not select any specific columns, just to make sure I got everything back. The returned object only seems to have 4 columns (the filtering on region works). Though the column names are inferred from the standard maf columns, they seem to be chromosome, start, end, sample_id, so pretty much the obligatory BED columns.

Here's the head of the returned object:

> head(muts_region)
# A tibble: 6 × 4
  Hugo_Symbol Entrez_Gene_Id Center    NCBI_Build               
  <chr>                <int> <chr>     <chr>                    
1 8                128723209 128723209 OCI-Ly3                  
2 8                128723217 128723217 BLGSP-71-30-00642-01A-01E
3 8                128723276 128723276 BLGSP-71-17-00357-01B-09E
4 8                128723405 128723405 09-12864T                
5 8                128723409 128723409 BLGSP-71-30-00778-01A-01E
6 8                128723410 128723410 02-15630_tumorA          
> region
[1] "8:128723128-128774067"

The uncompressed maf file looks fine at a glance. So here's what I am proposing. Let's compress the maf file with bgzip and then index it with tabix. The function can then be updated to read from this file when the mode is set to strelka2. I think this makes sense since when the mode is the default (slims-3) it reads from a bgzipped and tabix-indexed maf file. not bed suffix.

Thoughts?

rdmorin commented 1 year ago

It will have to be sorted before compressing it otherwise tabix will refuse to index it. Whether this is worth doing really depends on whether or not we need more details for these Strelka variants for our needs. @lkhilton what do you think? The Strelka MAFs are way larger so keeping a bgzip-compressed copy will take up some additional space in the GAMBL directories

lkhilton commented 1 year ago

I'm pretty ambivalent. I think the current situation where there's only a minimal tabix indexed file is fine, so then maybe we just need the path in the config updated and the it's done.