morinlab / GAMBLR

Set of standardized functions to operate with genomic data
https://morinlab.github.io/GAMBLR/
MIT License
3 stars 2 forks source link

region_to_gene throws an error in some cases #246

Closed rdmorin closed 10 months ago

rdmorin commented 1 year ago

When I run region_to_gene I often get an error instead of a result. For example:

region_to_gene(region="chr14:105862865-105863058",genome_build = "hg38")
Error in if (!str_detect(genes$chromosome[1], "chr")) { : 
  missing value where TRUE/FALSE needed
mattssca commented 1 year ago

Thanks for reporting Ryan.

I just had a quick look, and it seems that there are no genes residing inside the specified region (in the example you provided). Could this be the reason that the function "often" fails for you? Regardless, it would be useful to have the function notify the user if there are no genes found within the specified region.

For example, if I expand the region in your example by 2000bp we get this back:

> region_to_gene(region="chr14:105861865-105864058",genome_build = "hg38")
3 gene(s) returned for 1 region(s)
  chromosome     start       end hugo_symbol region_start region_end
1      chr14 105863197 105863258       IGHJ6    105861865  105864058
2      chr14 105863415 105863465      IGHJ3P    105861865  105864058
3      chr14 105863813 105863862       IGHJ5    105861865  105864058

As a note, I have ran in to similar issues when I (mistakenly) provide a region to this function that is in respect to another projection, than what is specified within the call of the region_to_gene function.

rdmorin commented 1 year ago

Yes, that's likely the cause. Do you think it should throw an error (as it's currently doing) or elegantly return an empty data frame instead? The issue I'm having is that calling this function within another function currently requires the call to be wrapped in a tryCatch().

mattssca commented 1 year ago

If we have it return an empty data frame (but also a message so the user would be aware) I would assume that this wouldn't break in a nested environment, like you're describing. So I would vote for this.

rdmorin commented 1 year ago

The messages should only be used in verbose mode though, right? Currently it's already printing out a message for each region the function is given, which leads to a very verbose output if a function calls it repeatedly.

For example, here's the current output (with just the region being printed in the calling function for debugging purposes):

[1] "chr1:92914901-92915003"
1 gene(s) returned for 1 region(s)
[1] "chr5:10400752-10401894"
1 gene(s) returned for 1 region(s)
[1] "chr14:105773847-105859779"
7 gene(s) returned for 1 region(s)
[1] "chr20:17966979-17967114"
2 gene(s) returned for 1 region(s)
[1] "chr22:22322826-22355878"
7 gene(s) returned for 1 region(s)
[1] "chr6:29810026-29904500"
8 gene(s) returned for 1 region(s)
[1] "chr7:70496350-70496767"
1 gene(s) returned for 1 region(s)
[1] "chrX:1087815-1088665"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr14:64458111-64461841"
2 gene(s) returned for 1 region(s)
[1] "chr5:79031008-79032539"
1 gene(s) returned for 1 region(s)
[1] "chr6:32654870-32754078"
7 gene(s) returned for 1 region(s)
[1] "chr17:13733571-13734989"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr13:88396330-88398128"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr15:40050289-40051291"
1 gene(s) returned for 1 region(s)
[1] "chr18:15656295-15702692"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr22:31226197-31227014"
1 gene(s) returned for 1 region(s)
[1] "chr1:182114358-182114666"
1 gene(s) returned for 1 region(s)
[1] "chr14:105773233-105858897"
6 gene(s) returned for 1 region(s)
[1] "chr2:88860382-88860480"
1 gene(s) returned for 1 region(s)
[1] "chr5:50456310-50468640"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr5:105836364-105839254"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr14:105773387-105859809"
7 gene(s) returned for 1 region(s)
[1] "chr17:49796951-49803952"
1 gene(s) returned for 1 region(s)
[1] "chr1:124945634-124955006"
Here's the original error message:
missing value where TRUE/FALSE needed[1] "chr14:105859052-105859507"
1 gene(s) returned for 1 region(s)
[1] "chr12:11380951-11420570"
mattssca commented 1 year ago

Agree! I will provide an update for this function. Thanks for reporting.