jokergoo / rGREAT

GREAT Analysis - Functional Enrichment on Genomic Regions
https://jokergoo.github.io/rGREAT
Other
81 stars 11 forks source link

How to infer the seqinfo message from the output of rGREAT::plotRegionGeneAssociationGraphs ? #25

Closed peranti closed 2 years ago

peranti commented 3 years ago

Hi @jokergoo

Thanks for the amazing package.

I have a very basic question, which may even be not related to this package functionality. However, I am sure you would be aware of the answer. I couldn't find the reason for the output message on the forums, maybe it is known to be basic information in the Bioconductor world?

The question is about the plotRegionGeneAssociationGraphs output and I have mentioned the parameters used for both the functions below.

#  Command for Job submission
job_chrLoop = submitGreatJob(inpData, species = "hg19", rule = "oneClosest")

# Inspecting the job
job_chrLoop

## Output
Submit time: 2020-10-21 14:36:22 
Version: 4.0.4 
Species: hg19 
Inputs: 256 regions
Background: wholeGenome 
Model: Single nearest gene 
  within 1000 kb
Include curated regulatory domains

Enrichment tables for following ontologies have been downloaded:
  None

Important info to be noticed: Inputs: 256 regions

# Command for gene-region mappings
res_chrLoop = plotRegionGeneAssociationGraphs(job_chrLoop, plot = FALSE)
res_chrLoop

## Output
GRanges object with 256 ranges and 2 metadata columns:
        seqnames              ranges strand |        gene   distTSS
           <Rle>           <IRanges>  <Rle> | <character> <numeric>
    [1]     chr1       842096-842097      * |      SAMD11    -19021
    [2]     chr1       846155-846156      * |      SAMD11    -14962
    [3]     chr1       970060-970061      * |        AGRN     14558
    [4]     chr1       989313-989314      * |      RNF223     20373
    [5]     chr1       993960-993961      * |      RNF223     15726
    ...      ...                 ...    ... .         ...       ...
  [252]     chr1 244240828-244240829      * |      ZBTB18     26244
  [253]     chr1 246669992-246669993      * |       SMYD3       526
  [254]     chr1 249106516-249106517      * |     SH3BP5L     14315
  [255]     chr1 249110601-249110602      * |     SH3BP5L     10230
  [256]     chr1 249119531-249119532      * |     SH3BP5L      1300
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Important info to be noticed: [256] chr1 249119531-249119532 * | SH3BP5L 1300

What does this message: seqinfo: 1 sequence from an unspecified genome; no seqlengths exactly mean? I think this is generated from the main Bioconductor packages and not rGREAT, if I am not wrong, this message is from GenomeInfoDb::Seqinfo.

How to identify this 1 sequence? As far as I checked, the number of input regions and the mappings are matching (i.e., 256). What is the reason behind this information in the output?

Edit:

Input data: It doesn't contain any NA values.

    chr       start       end
    <chr>     <int>     <dbl>
  1 chr1  244094861 244094862
  2 chr1   47070826  47070827
  3 chr1   43770784  43770785
  4 chr1   10730860  10730861
  5 chr1   51703271  51703272

I found the 1 sequence using the following command:

> seqinfo(res_chrLoop)

Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1             NA         NA   <NA>

However, I am not aware of the reason for this output. Which input resulted in this message?

jokergoo commented 3 years ago

Hi, here in seqinfo, the "sequence" does not mean the regions you have, but the "chromosomes".

peranti commented 3 years ago

Thanks for the reply.

Isn't the input as specified in the package manual?

    chr       start       end
    <chr>     <int>     <dbl>
  1 chr1  244094861 244094862
  2 chr1   47070826  47070827
  3 chr1   43770784  43770785
  4 chr1   10730860  10730861
  5 chr1   51703271  51703272

How could we avoid this warning message?

jokergoo commented 3 years ago

This information is from the GenomicRanges package (or more internally, the GenomeInfoDb package). For the 256 regions you provided, the function only knows there is chr1, but no information of the genome and the full chromosome lengths. That is why it says the object only contains 1 sequence (chr1) and no seqlengths.

Basically, you can just ignore that, in the most analysis, it won't affect anything.

jokergoo commented 3 years ago

If you really want to assign a valid seqinfo for hg19, you can do as:

seqinfo(res_chrLoop) = Seqinfo(genome="hg19")

or

seqinfo(res_chrLoop) = Seqinfo(genome="hg19")[paste0("chr", c(1:22, "X", "Y"))]

Check the manual for Seqinfo() function.