seandavi / TargetOsteoAnalysis

https://seandavi.github.io/TargetOsteoAnalysis/
MIT License
4 stars 1 forks source link

target_huex_se() need to update the target URL #1

Open danhtruong opened 2 years ago

danhtruong commented 2 years ago

I changed the urls for the data and metadata to point to the correct url.

target_huex_se = function() {
  genedat = "https://target-data.nci.nih.gov/Public/OS/gene_expression_array/L3/gene_core_rma_summary_annot.txt"
  sdrf    = "https://target-data.nci.nih.gov/Public/OS/gene_expression_array/METADATA/TARGET_OS_GeneExpressionArray_20160812.sdrf.txt"

  dat = readr::read_tsv(genedat)
  dat2 = suppressWarnings(readr::read_tsv(sdrf))
  sample_map = as.vector(target_usi_to_samplename(dat2[[1]]))
  names(sample_map) = dat2$`Array Data File`

  # Make the assay matrix
  assay_mat = as.matrix(dat[,-c(1:2)])
  rownames(assay_mat) = dat[[1]]
  colnames(assay_mat) = unname(sample_map[match(colnames(assay_mat),names(sample_map))])

  # split transcripts, symbols, and pick most common symbol
  genes = str_split(dat[[2]],' // ')
  tx_list = lapply(genes,function(g) {
    if(length(g)<2) return(NA)
    return(g[seq(1,length(g),2)])
  })
  symbol_list = lapply(genes,function(g) {
    if(length(g)<2) return(NA)
    return(unique(g[seq(2,length(g),2)]))
  })
  symbol = unlist(lapply(genes,function(g) {
    if(length(g)<2) return(NA)
    tb = sort(table(g[seq(2,length(g),2)]),decreasing = TRUE)
    return(unlist(names(tb)[1]))
  }))

  # clinical/coldata
  cdata = target_load_clinical()
  cdata = as.data.frame(cdata)
  cdata[[1]] = target_usi_to_samplename(cdata[[1]])
  rownames(cdata) = make.unique(cdata[[1]])
  cdata = cdata[colnames(assay_mat),]

  # construct rowdata
  rowdata = DataFrame(symbol = symbol, tx_list = SimpleList(tx_list),
                      symbol_list = SimpleList(symbol_list),
                      row.names = dat[[1]])

  return(SummarizedExperiment(assays = list(exprs = assay_mat), rowData = rowdata, colData = cdata))
}

The result is here. I didn't check other data loading functions.

target_os <- target_huex_se()

Rows: 22011 Columns: 91
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): gene_assignment_final
dbl (90): probeset_id, AE248-HuEx-1_0-st-v2-01-1_(PATKSS-01A-01R).CEL, AE249-HuEx-1_0-st-v2-01-1_(PAUTWB-01A-01R).C...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
* `Material Type` -> `Material Type...3`
* `Term Source REF` -> `Term Source REF...4`
* `Term Source REF` -> `Term Source REF...6`
* `Term Source REF` -> `Term Source REF...8`
* `Material Type` -> `Material Type...11`
* ...
Rows: 180 Columns: 44
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (38): Source Name, Provider, Material Type...3, Term Source REF...4, Characteristics[Organism], Term Source RE...
dbl   (4): Comment[Scanning Station No], Comment[OCG Data Level]...37, Comment[OCG Data Level]...40, Comment[OCG Da...
lgl   (1): Comment[Array Lot No]
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning in length.out :
  closing unused connection 5 (ftp://caftpd.nci.nih.gov/pub/OCG-DCC/TARGET/OS/gene_expression_array/L3/gene_core_rma_summary_annot.txt)
seandavi commented 2 years ago

Thx, @danhtruong. Looks like I need to do some cleanup. I'm really glad to see that someone is using this!