Include name field in results table

jeremymsimon commented 1 year ago

Hi- This package is amazing to include in our workflows!

One question: on the web version, GREAT accepts a "name" field as part of the standard BED format: https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655452/File+Formats#FileFormats-Whatshouldmytestregionsfilecontain%3F

If I include this in my BED-like data.frame or GenomicRanges object, it seems to get ignored and lost in the results table. Ideally I want getRegionGeneAssociations(job) to include chr, start, end, and name if it is present. This would be especially helpful if my input is in a particular order, since the results table is re-sorted I can't simply apply names(output) <- names(input)

Is it possible to add this functionality?

Example here:

gr <- randomRegions(genome="hg19",nr=10)
names(gr) <- paste0("region",1:length(gr))

gr
GRanges object with 8 ranges and 0 metadata columns:
          seqnames              ranges strand
             <Rle>           <IRanges>  <Rle>
  region1     chr1     3353130-3359069      *
  region2     chr2   51654683-51660300      *
  region3     chr3 178694412-178697830      *
  region4     chr4   27638693-27641182      *
  region5     chr5   44128965-44136774      *
  region6     chr6   73739346-73744995      *
  region7     chr7 156259319-156268477      *
  region8     chrX   18742270-18745477      *
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths

job <- submitGreatJob(gr,species="hg19",rule="twoClosest", gr_is_zero_based = T)
getRegionGeneAssociations(job)

GRanges object with 8 ranges and 2 metadata columns:
      seqnames              ranges strand | annotated_genes    dist_to_TSS
         <Rle>           <IRanges>  <Rle> | <CharacterList>  <IntegerList>
  [1]     chr1     3353130-3359069      * | ARHGEF16,PRDM16  -14890,370325
  [2]     chr2   51654683-51660300      * |           NRXN1        -397955
  [3]     chr3 178694412-178697830      * |    ZMAT3,KCNMB2   93946,419633
  [4]     chr4   27638693-27641182      * |           STIM2         777475
  [5]     chr5   44128965-44136774      * |       FGF10,NNT  256029,530045
  [6]     chr6   73739346-73744995      * |     KHDC1,KCNQ5  277767,410651
  [7]     chr7 156259319-156268477      * |     SHH,C7orf13 -658931,169450
  [8]     chrX   18742270-18745477      * |     PPEF1,PHKA2   34842,258842
  -------
  seqinfo: 8 sequences from an unspecified genome; no seqlengths

packageVersion("rGREAT")
[1] ‘2.1.3’

Now the same as BED format submitted to web tool:

# input.bed
chr1    3353130 3359069 region1 1   +
chr2    51654683    51660300    region2 1   +
chr3    178694412   178697830   region3 1   +
chr4    27638693    27641182    region4 1   +
chr5    44128965    44136774    region5 1   +
chr6    73739346    73744995    region6 1   +
chr7    156259319   156268477   region7 1   +
chrX    18742270    18745477    region8 1   +

# output_regions.txt
# GREAT version 4.0.4   Species assembly: hg19  Association rule: Two nearest genes: 1000000 bp max extension, curated regulatory domains included
region1 ARHGEF16 (-14890), PRDM16 (+370325)
region2 NRXN1 (-397955)
region3 ZMAT3 (+93946), KCNMB2 (+419633)
region4 STIM2 (+777475)
region5 FGF10 (+256029), NNT (+530045)
region6 KHDC1 (+277767), KCNQ5 (+410651)
region7 SHH (-658931), C7orf13 (+169450)
region8 PPEF1 (+34842), PHKA2 (+258842)

As an additional side note, why am I only getting 8 regions when nr=10?

Thanks!

jokergoo commented 1 year ago

Now the names are kept in the output:

> getRegionGeneAssociations(job)
GRanges object with 7 ranges and 2 metadata columns:
          seqnames              ranges strand | annotated_genes    dist_to_TSS
             <Rle>           <IRanges>  <Rle> | <CharacterList>  <IntegerList>
  region1     chr1 144369564-144373476      * |   NBPF9,PPIAL4B  -440228,-7274
  region3     chr3 146551327-146559750      * |     PLSCR5,ZIC4 -231536,566532
  region4     chr4     8691809-8698816      * |        CPZ,HMX1  100926,178230
  region5     chr5 149675764-149677003      * |     CAMK2A,ARSI     -7192,6132
  region6     chr6   78404287-78413825      * |            MEI4           8681
  region7     chr7 153478599-153486676      * |            DPP6        -267127
  region8     chrX   86523878-86533664      * |           KLHL4        -244046
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths

If the input is a GRanges object with names, the names are kept. If the input is a data frame, you need to specify use_name_column = TRUE in submitGreatJob().

For the second question, the final number of regions is not always the same as what you have set. The reason is I let the proportion of random regions to be the same as the chromosome lengths, which means chromosome 1 has more random genes than chromosome 19.

Please update the package from GitHub.

jeremymsimon commented 1 year ago

Thanks @jokergoo! I'll give this a try. I will close this for now and reopen if I have any issues

jokergoo / rGREAT

Include name field in results table #38