robinweide / GENOVA

GENome Organisation Visual Analytics
GNU General Public License v3.0
68 stars 15 forks source link

how to write insulation score s3 format to bedgraph format #318

Open jiangshan529 opened 1 year ago

jiangshan529 commented 1 year ago

As shown in the title, I would like to get a bedgraph file of insulation score. Thanks!

teunbrand commented 1 year ago

Hi there,

You can export the insulation score as a bedgraph for viewing in IGV as follows. The {rtracklayer} package has several convenient export functions.

For reproducibility purposes, let's presume we have an insulation score object calculated by GENOVA.

library(GENOVA)
library(rtracklayer)

exp <- get_test_data("40k")
ins <- insulation_score(exp)

To export as a bedgraph, we need to convert the data.frame that is inside ins to a GenomicRanges object. Note that we should typically add +1 to starts, as GENOVA tends to use the bedfile convention for intervals (starts at 0) whereas GenomicRanges use 1-based starts.

# Extract data: rename and drop 'bins' column
data <- setNames(
  ins$insula_score[-4],
  c("chrom", "start", "end", "value")
)
data <- data[is.finite(data$value),]

# Convert to GRanges
gr <- with(data, GRanges(chrom, IRanges(start + 1, end)))
score(gr) <- data$value

# Export to bedgraph
export.bedGraph(gr, here::here("my_bedgraph.bedGraph"))

If we want to export to bigwig instead, we need to include the sequence lengths of every chromosome as well. The {rtracklayer} also has a convenient function for this.

# Export to bigwig
gr <- with(
  data, 
  GRanges(
    chrom, 
    IRanges(start + 1, end), 
    seqinfo = SeqinfoForUCSCGenome("hg19")
  ),
)
score(gr) <- data$value
export.bw(gr, here::here("my_bigwig.bw"))
jiangshan529 commented 1 year ago

Hi there,

You can export the insulation score as a bedgraph for viewing in IGV as follows. The {rtracklayer} package has several convenient export functions.

For reproducibility purposes, let's presume we have an insulation score object calculated by GENOVA.

library(GENOVA)
library(rtracklayer)

exp <- get_test_data("40k")
ins <- insulation_score(exp)

To export as a bedgraph, we need to convert the data.frame that is inside ins to a GenomicRanges object. Note that we should typically add +1 to starts, as GENOVA tends to use the bedfile convention for intervals (starts at 0) whereas GenomicRanges use 1-based starts.

# Extract data: rename and drop 'bins' column
data <- setNames(
  ins$insula_score[-4],
  c("chrom", "start", "end", "value")
)
data <- data[is.finite(data$value),]

# Convert to GRanges
gr <- with(data, GRanges(chrom, IRanges(start + 1, end)))
score(gr) <- data$value

# Export to bedgraph
export.bedGraph(gr, here::here("my_bedgraph.bedGraph"))

If we want to export to bigwig instead, we need to include the sequence lengths of every chromosome as well. The {rtracklayer} also has a convenient function for this.

# Export to bigwig
gr <- with(
  data, 
  GRanges(
    chrom, 
    IRanges(start + 1, end), 
    seqinfo = SeqinfoForUCSCGenome("hg19")
  ),
)
score(gr) <- data$value
export.bw(gr, here::here("my_bigwig.bw"))

Hi, this is so wonderful. There's a small issue, when I run

gr <- with(
  data, 
  GRanges(
    chrom, 
    IRanges(start + 1, end), 
    seqinfo = SeqinfoForUCSCGenome("hg38")
  ),
)

it said "Error in .normarg_seqnames2(seqnames, seqinfo) : 'seqnames' contains sequence names with no entries in 'seqinfo'". How to deal with this? Thanks!

teunbrand commented 1 year ago

It might be that your data has chromosome names that don't match the UCSC style for naming chromosomes. For example if your chromosome names are "1", "2", "3" and not "chr1", "chr2", "chr3".

Alternatively, it might be that one of the scaffold names mismatch. Those are the chromosomes with the weird names like "chr22_KN196486v1_alt".

In any case, you can always manually construct a sequence information object manually if you have the lengths of chromosomes and chromosome names. You can do this by using the GenomeInfoDb::Seqinfo() constructor function, and provide that result to the seqinfo argument in the GRanges() call.

It should be constructed such that all unique(data$chrom) are chromosome names in the sequence information object.

jiangshan529 commented 1 year ago

It might be that your data has chromosome names that don't match the UCSC style for naming chromosomes. For example if your chromosome names are "1", "2", "3" and not "chr1", "chr2", "chr3".

Alternatively, it might be that one of the scaffold names mismatch. Those are the chromosomes with the weird names like "chr22_KN196486v1_alt".

In any case, you can always manually construct a sequence information object manually if you have the lengths of chromosomes and chromosome names. You can do this by using the GenomeInfoDb::Seqinfo() constructor function, and provide that result to the seqinfo argument in the GRanges() call.

It should be constructed such that all unique(data$chrom) are chromosome names in the sequence information object.

Yes, thanks! That's right. My chrom is '1' instead of 'chr1'.