Closed lkhilton closed 1 year ago
I think this might just be a logical error. I was running it like this:
gambl_maf <- get_ssm_by_regions(
projection = "hg38",
regions_bed = regions,
streamlined = FALSE,
basic_columns = TRUE
)
I was thinking that basic_columns = FALSE
would ask the function to return every column in the maf file. Instead you have to use basic_columns = TRUE
to get it to return 45 columns of maf data. I don't think there is a combination of arguments that returns the complete maf becuase there isn't an else
to the if (basic_columns)
part of get_ssm_by_region
.
(Also this might be coming in a future PR but the basic columns option isn't documented for this get_ssm_by_regions
).
Can you clarify if you are interested in all MAF columns or just the first 45? If just the first 45 then I agree this should work.
So, the proposed solution does not seem to work right off the bat. For some reason, when trying to get all columns returned with maf_columns = names(maf_header)
it only returns 10 columns. If I change the selected columns in maf_columns, to say, 1:75 it returns the first 75 columns. But any additional columns do not work, i.e columns 1:76 return 10 columns. I am wondering if this is something that's related to vroom?
However, with some tweaking, I got the desired output. Here, we extract columns 1:75 and then in another step 76:116 and cbind
to get the desired output. Might not be the most elegant solution, but seems to work from the initial testing I've done. Remote testing and compatibility with get_ssm_by_regions remain to be looked at.
Running this function with basic_columns = TRUE
, returns the first 45 columns. If set to FALSE
, all 116 MAF columns will be kept.
Let me know if this solution is adequate or we can discuss another route to take.
Perhaps it relates to the issue I mentioned in lab meeting. There are inconsistent columns in our merged MAFs towards the end (last 12 or so columns)
I'm not sure why the MAF needs to be subset by columns anyway. This seems to be over-complicating it. Aren't we just trying to return the entire MAF (all columns)?
After having another look at this, with fresh eyes recharged by egg hunts and other easter related tasks. I think I have successfully identified and solved the problem. Indeed, the problem with the existing vroom chunk is most likely related to inconsistencies with the columns in the merged MAF (some lines in the merged MAF only have 104 columns).
Added the delimiter parameter to the vroom
chunk seems to have allowed us to get all columns back, as such;
muts_region = vroom::vroom(I(muts), col_types = paste(maf_column_types, collapse = ""), col_names = maf_columns, delim = "\t")
And this returns the number of columns based on the value of basic_columns
. If this parameter is set to TRUE (default), the first 45 columns will be returned, and if set to FALSE, all 116 columns will be returned. So no need to go through all that hassle that I initially suggested here. My experience with running tabix is limited, but I think, that you have to specify what columns we want back, at least in the way it's currently set up.
tabix_command = paste(tabix_bin, full_maf_path_comp, region, "| cut -f", paste(maf_indexes, collapse = ","))
So the fix here is to add another statement if basic_columns = FALSE
that extracts all MAF columns from maf_header
and the related column types. Just like we do for basic_columns = TRUE
, but with more columns added (if basic_columns = FALSE
)
if(streamlined){
maf_columns = names(maf_header)[c(6, 16, 42)]
maf_column_types = "ici"
}else if(basic_columns){ #get first 45 columns of the MAF
maf_columns = names(maf_header)[c(1:45)]
maf_column_types = "ciccciiccccccclcccclllllllllllllllccccciiiiii"
}else{
maf_columns = names(maf_header) #return all MAF columns (116)
maf_column_types = "ciccciiccccccclcccclllllllllllllllccccciiiiiiccccccccccccinnccccccccccccccccccclcccccccccnclcncccclncccclllllllllicn"
}
To back up the previous statement (with some MAFs missing the last 12 columns), if basic_column
is set to FALSE, the return gets this warning message
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
And if I run problems(test_maf)
we get this in return confirming the suspected issue;
# A tibble: 3,928 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 2 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
2 16 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
3 18 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
4 19 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
5 21 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
6 27 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
7 28 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
8 30 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
9 32 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
10 33 104 116 columns 104 columns /tmp/RtmpTEHzah/file17a266d6bccd
I have reverted the code and made necessary (minimal) adjustments to ensure this function is working as intended. I also removed the parameters related to specifying what columns to be returned, since this task can easily be executed by the user on the return of get_ssm_by_region(basic_columns = FALSE)
.
There was only one other function that used the, now deprecated parameters, prettyRainfallPlot
, and the deprecated parameters from get_ssm_by_region
have been removed from that function.
> get_ssm_by_region(region = "chr8:128,723,128-128,774,067", basic_columns = TRUE) %>%
+ dim()
[1] 4879 45
> get_ssm_by_region(region = "chr8:128,723,128-128,774,067", basic_columns = FALSE) %>%
+ dim()
[1] 4879 116
Tabix spits out all columns. The cut command is what's pruning the output down. I didn't see the code you changed but if the tabix command is being run differently then maybe the easiest thing to do here is to simplify that command in this case (remove everything starting at | cut...)
The tabix command has not changed. However, the vroom
line has been updated, together with the first few lines that specify what columns the user wants back, depending on what's specified under basic_columns
and/or streamlined
.
The tabix command still takes maf_indexes
variable, the only difference is that what's in this variable depends on what's specified with basic_columns
. If set to FALSE, all 116 MAF columns are returned. If set to TRUE, only the first 45 columns are returned.
I can confirm that the proposed solution works just as well when running GAMBLR on a remote setup (not gphost).
I'm a bit confused because I thought this was resolved back in #97, but it seems that
get_ssm_by_regions
still only returns 5 columns of maf data. The logic for handling thestreamlined
argument still only outputs either 3 columns ifstreamlined == TRUE
or else 5 columns. Should this function output the full maf columns for the specified regions?https://github.com/morinlab/GAMBLR/blob/eecaa8858921f8e6792949b708455b3727218ba9/R/database.R#L2169-L2178