morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

`get_ssm_by_regions` only returns 5 columns #189

Closed lkhilton closed 1 year ago

lkhilton commented 1 year ago

I'm a bit confused because I thought this was resolved back in #97, but it seems that get_ssm_by_regions still only returns 5 columns of maf data. The logic for handling the streamlined argument still only outputs either 3 columns if streamlined == TRUE or else 5 columns. Should this function output the full maf columns for the specified regions?

https://github.com/morinlab/GAMBLR/blob/eecaa8858921f8e6792949b708455b3727218ba9/R/database.R#L2169-L2178

lkhilton commented 1 year ago

I think this might just be a logical error. I was running it like this:

gambl_maf <- get_ssm_by_regions(
    projection = "hg38",
    regions_bed = regions, 
    streamlined = FALSE, 
    basic_columns = TRUE
) 

I was thinking that basic_columns = FALSE would ask the function to return every column in the maf file. Instead you have to use basic_columns = TRUE to get it to return 45 columns of maf data. I don't think there is a combination of arguments that returns the complete maf becuase there isn't an else to the if (basic_columns) part of get_ssm_by_region.

https://github.com/morinlab/GAMBLR/blob/eecaa8858921f8e6792949b708455b3727218ba9/R/database.R#L2237-L2243

(Also this might be coming in a future PR but the basic columns option isn't documented for this get_ssm_by_regions).

rdmorin commented 1 year ago

Can you clarify if you are interested in all MAF columns or just the first 45? If just the first 45 then I agree this should work.

mattssca commented 1 year ago

So, the proposed solution does not seem to work right off the bat. For some reason, when trying to get all columns returned with maf_columns = names(maf_header) it only returns 10 columns. If I change the selected columns in maf_columns, to say, 1:75 it returns the first 75 columns. But any additional columns do not work, i.e columns 1:76 return 10 columns. I am wondering if this is something that's related to vroom?

However, with some tweaking, I got the desired output. Here, we extract columns 1:75 and then in another step 76:116 and cbind to get the desired output. Might not be the most elegant solution, but seems to work from the initial testing I've done. Remote testing and compatibility with get_ssm_by_regions remain to be looked at. Running this function with basic_columns = TRUE, returns the first 45 columns. If set to FALSE, all 116 MAF columns will be kept.

Let me know if this solution is adequate or we can discuss another route to take.

rdmorin commented 1 year ago

Perhaps it relates to the issue I mentioned in lab meeting. There are inconsistent columns in our merged MAFs towards the end (last 12 or so columns)

rdmorin commented 1 year ago

I'm not sure why the MAF needs to be subset by columns anyway. This seems to be over-complicating it. Aren't we just trying to return the entire MAF (all columns)?

mattssca commented 1 year ago

After having another look at this, with fresh eyes recharged by egg hunts and other easter related tasks. I think I have successfully identified and solved the problem. Indeed, the problem with the existing vroom chunk is most likely related to inconsistencies with the columns in the merged MAF (some lines in the merged MAF only have 104 columns).

Added the delimiter parameter to the vroom chunk seems to have allowed us to get all columns back, as such;

muts_region = vroom::vroom(I(muts), col_types = paste(maf_column_types, collapse = ""), col_names = maf_columns, delim = "\t")

And this returns the number of columns based on the value of basic_columns. If this parameter is set to TRUE (default), the first 45 columns will be returned, and if set to FALSE, all 116 columns will be returned. So no need to go through all that hassle that I initially suggested here. My experience with running tabix is limited, but I think, that you have to specify what columns we want back, at least in the way it's currently set up.

tabix_command = paste(tabix_bin, full_maf_path_comp, region, "| cut -f", paste(maf_indexes, collapse = ","))

So the fix here is to add another statement if basic_columns = FALSE that extracts all MAF columns from maf_header and the related column types. Just like we do for basic_columns = TRUE, but with more columns added (if basic_columns = FALSE)

  if(streamlined){
    maf_columns = names(maf_header)[c(6, 16, 42)]
    maf_column_types = "ici"

  }else if(basic_columns){ #get first 45 columns of the MAF
    maf_columns = names(maf_header)[c(1:45)]
    maf_column_types =  "ciccciiccccccclcccclllllllllllllllccccciiiiii"
  }else{
    maf_columns = names(maf_header) #return all MAF columns (116)
    maf_column_types = "ciccciiccccccclcccclllllllllllllllccccciiiiiiccccccccccccinnccccccccccccccccccclcccccccccnclcncccclncccclllllllllicn"
  }

To back up the previous statement (with some MAFs missing the last 12 columns), if basic_column is set to FALSE, the return gets this warning message

Warning message:           
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)

And if I run problems(test_maf) we get this in return confirming the suspected issue;

# A tibble: 3,928 × 5
     row   col expected    actual      file                            
   <int> <int> <chr>       <chr>       <chr>                           
 1     2   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 2    16   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 3    18   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 4    19   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 5    21   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 6    27   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 7    28   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 8    30   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
 9    32   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
10    33   104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd

I have reverted the code and made necessary (minimal) adjustments to ensure this function is working as intended. I also removed the parameters related to specifying what columns to be returned, since this task can easily be executed by the user on the return of get_ssm_by_region(basic_columns = FALSE).

There was only one other function that used the, now deprecated parameters, prettyRainfallPlot, and the deprecated parameters from get_ssm_by_region have been removed from that function.

> get_ssm_by_region(region = "chr8:128,723,128-128,774,067", basic_columns = TRUE) %>%
+   dim()
[1] 4879   45     

> get_ssm_by_region(region = "chr8:128,723,128-128,774,067", basic_columns = FALSE) %>%
+ dim()
[1] 4879  116                                                                                                                                                                                                                                              
rdmorin commented 1 year ago

Tabix spits out all columns. The cut command is what's pruning the output down. I didn't see the code you changed but if the tabix command is being run differently then maybe the easiest thing to do here is to simplify that command in this case (remove everything starting at | cut...)

mattssca commented 1 year ago

The tabix command has not changed. However, the vroom line has been updated, together with the first few lines that specify what columns the user wants back, depending on what's specified under basic_columns and/or streamlined.

mattssca commented 1 year ago

The tabix command still takes maf_indexes variable, the only difference is that what's in this variable depends on what's specified with basic_columns. If set to FALSE, all 116 MAF columns are returned. If set to TRUE, only the first 45 columns are returned.

mattssca commented 1 year ago

I can confirm that the proposed solution works just as well when running GAMBLR on a remote setup (not gphost).