morris-lab / CellOracle

This is the alpha version of the CellOracle package
Other
297 stars 51 forks source link

add chicken genome to refgenomes #24

Closed wangmhan closed 4 years ago

wangmhan commented 4 years ago

Hi,

I wonder if it is possible to add chicken genome galGal6 to reference genomes? Although the latest version on Ensembl now is version100, what I used is version97. As we would like to keep consistent with other data from the lab.

It would be also helpful is there is some guideline so that make it possible for me to make reference genomes.

Thank you.

wangmhan commented 4 years ago

Hi,

A following question, as some people ask similar question for yeast, and seems there is a tutorial: https://github.com/morris-lab/CellOracle/blob/master/docs/notebooks/01_ATAC-seq_data_processing/option1_scATAC-seq_data_analysis_with_cicero/misc/make_tss_referenece_from_homer_data-S%20cerevisiae.ipynb.

I just tried it out for chicken reference. I wonder if it can also be generate by GTF file? If yes, then it is easy to make one's own. I tried the following script: library(GenomeInfoDb); library(ensembldb) anno_path <- "/user/references/genomes/ENS_g6/chicken_Gg6_atac_pre/" db <- ensDbFromGtf(gtf=paste0(anno_path,"Gg6_extended_200819.filter.gtf"), organism = "Gallus_gallus", genomeVersion = "GRCg6a", version = 97) edb <- EnsDb(db)

gene.ranges <- genes(edb); length(gene.ranges$gene_id) gene.ranges <- gene.ranges[gene.ranges$gene_biotype == 'protein_coding', ];length(gene.ranges$gene_id) tss.ranges <- data.frame(Chr = seqnames(gene.ranges), Start = 0, End = 0, GeneName = gene.ranges$symbol, DistancetoTSS = "-450", strand = strand(gene.ranges), GeneID = gene.ranges$gene_id, ps.start = start(gene.ranges), ps.end = end(gene.ranges) ) tss.ranges[tss.ranges$strand=="+","Start"]=tss.ranges[tss.ranges$strand=="+","ps.start"]-1100 tss.ranges[tss.ranges$strand=="+","End"]=tss.ranges[tss.ranges$strand=="+","ps.start"] tss.ranges[tss.ranges$strand=="-","Start"]=tss.ranges[tss.ranges$strand=="-","ps.end"] tss.ranges[tss.ranges$strand=="-","End"]=tss.ranges[tss.ranges$strand=="-","ps.end"]+1100

The reason why I would like to use GTF file is that: 1) I used Ensembl as reference, while the one from Homer seems from NCBI. 2) the one from Homer have much less genes(6981) than Ensembl (12511). 3) we changed the GTF file a bit, so it is better to keep consistent.

If it is only TSS, I guess it is fine. If take promoter-TSS into consideration, I don't know roughly add 1100 bp can be good enough or not.

Also I don't know after generate the galGal6_tss_info.bed how to input it in the following step: ma.get_tss_info(peak_str_list=peaks, ref_genome= )

wangmhan commented 4 years ago

Also I found an alternative way to get TSS annotation. The region define as promoter-TSS here I used is (-2000,2000) to TSS.

The logic behind is to give annotation to all peaks, than filter out all promoter-TSS peaks. Then, generate the format the same as peak_file.parquet.

Here is my code: library(ChIPseeker); library(GenomicFeatures); library(org.Gg.eg.db) anno_path <- "/GROUP/references/genomes/ENS_g6/chicken_Gg6_atac_pre/" data_path <- paste0("/GROUP/mapped_data/10x_atac_10042020/",sample.name,"/outs/") txdb = makeTxDbFromGFF(file= paste0(anno_path,"Gg6_extended_200819.filter.gtf"), format = "gtf", organism = "Gallus gallus") peakAnno = annotatePeak(peak=paste0(data_path,"filtered_peak_bc_matrix/peaks.bed"), tssRegion = c(-2000, 2000), TxDb = txdb,annoDb = "org.Gg.eg.db") tssanno.raw = data.frame(peakAnno) tssanno = tssanno.raw[,c("seqnames","start","end","annotation","geneId","SYMBOL")] tssanno = tssanno[grep("Promoter",tssanno$annotation),]; dim(tssanno) table(tssanno$annotation); dim(tssanno[is.na(tssanno$SYMBOL),]) tssanno = tssanno[!is.na(tssanno$SYMBOL),]; dim(tssanno) tss.anno=data.frame(peakid=paste0(tssanno$seqnames,"",tssanno$start,"_",tssanno$end), gene_short_name=tssanno$SYMBOL) tss.anno[1:3,] write.csv(x = tss.anno, file = paste0("/scicore/home/tschoppp/wang0007/scATAC/procss_scATAC/cellOracle", "/peak_file.parquet"))

KenjiKamimoto-ac commented 4 years ago

Hi wangmhan,

Thank you for questions and suggestions.

(1) Unfortunately, the current version of celloracle does not support chicken genome.

It is because we are using homer database to generate promoter-tss annotation and default homer annotation database does not include chicken annotation data. Available species are below. Human (hg18, hg19, hg38), Mouse (mm8, mm9, mm10), Rat (rn4, rn5, rn6), Frog (xenTro2, xenTro3), Zebrafish (danRer7), Drosophila (dm3), C elegans (ce6, ce10), S. cerevisiae (sacCer2, sacCer3), pombe (ASM294v1), Arabidopsis (tair10), Rice (msu6).

(2) Your suggestions for custom promoter-tss is great! Although currently celloracle is not designed to use custom promoter-tss reference, we will definitely develop celloracle to support your method in the near future. Thank you very much for your suggestions.

wangmhan commented 4 years ago

Hi Kenji,

Hope the develop celloracle would come soon! Thank you for the positive response.

best regards, Menghan

On Thu, 9 Jul 2020 at 17:06, KenjiKamimoto-wustl122 < notifications@github.com> wrote:

Hi wangmhan,

Thank you for questions and suggestions.

(1) Unfortunately, the current version of celloracle does not support chicken genome.

It is because we are using homer database to generate promoter-tss annotation and default homer annotation database does not include chicken annotation data. Available species are below. Human (hg18, hg19, hg38), Mouse (mm8, mm9, mm10), Rat (rn4, rn5, rn6), Frog (xenTro2, xenTro3), Zebrafish (danRer7), Drosophila (dm3), C elegans (ce6, ce10), S. cerevisiae (sacCer2, sacCer3), pombe (ASM294v1), Arabidopsis (tair10), Rice (msu6).

(2) Your suggestions for custom promoter-tss is great! Although currently celloracle is not designed to use custom promoter-tss reference, we will definitely develop celloracle to support your method in the near future. Thank you very much for your suggestions.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/morris-lab/CellOracle/issues/24#issuecomment-656183166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJVYTVU3NYO3KNPVJWMZEBDR2XMG3ANCNFSM4OSYMRHQ .