morris-lab / CellOracle

This is the alpha version of the CellOracle package
Other
309 stars 56 forks source link

Request of axolotl genome #86

Open lengfei5 opened 2 years ago

lengfei5 commented 2 years ago

Hi there,

Appreciate the nice package, Could you please add the axolotl genome, one of the most amazing regenetive animals ? Both the genome assemble (AmexG_v6.0-DD) the gene annotation (AmexT_v47) can be found here: https://www.axolotl-omics.org/assemblies

Many thanks. Best, Jingkui

KenjiKamimoto-ac commented 2 years ago

@lengfei5

Sorry for the slow response! It is a great suggestion to add the Axolotol data to the celloracle package. Celloracle is using genomepy to convert a bed file to a fasta file, and the TSS annotation usually uses information obtained from Ensembl or homer database. If Axolotl data is not available through genomepy or Ensembl, we need to create some new functions to handle Axolotol data.

It may take some time, but I will try it.

lengfei5 commented 2 years ago

Many thanks @KenjiKamimoto-wustl122 If you just let me a list of files you need to put the axolotl in the CellOracle, I can share those files that I have already.

KenjiKamimoto-ac commented 2 years ago

@lengfei5

For the celloracle analysis, we need to get information about transcription start site (TSS) for every gene, and make pairs between TSS location in reference genome and gene symbol.

This is the example.

スクリーンショット 2022-07-15 午後2 40 33

I'm now attempting to create such data pairs for Axolotl reference genome. I'm now looking into the genome data (AmexG_v6.0-DD) and the gene annotation (AmexT_v47), and I got confused by its format.

In the annotation, there are mutiple gene symbols with [nr] or [hs] mark in the AmexT_v47 file. The below is an example.

chr10p ambMex60DD gene 313039 315424 1000 + . gene_id "AMEX60DD000001"; gene_name "ZFP37 [nr]|ZNF568 [hs]";

Sometimes it have single gene symbol assigned. chr10p ambMex60DD CDS 11068229 11068296 1000 - . gene_id "AMEX60DD000033"; transcript_id "GLT1D1|AMEX60DD301000033.1";

Which is the correct gene symbol should I use? Please teach me the meaning of the [nr] or [hs] mark?

lengfei5 commented 2 years ago

Dear @KenjiKamimoto-wustl122

Thank you for working on our genome and also apologize for the confusion. I would like to share the gene information file I have prepared. amexT_v47_hoxPatch.txt

lengfei5 commented 2 years ago

Actually our genome was just assembled a couple of years ago and the gene annotation is still far from perfect. In the table I have sent, you can find the gene ID with 'AMEX', and also gene symbols annotated by blasting axolotl transcripts to human protein (hs) or to non-redundant vertebrate protein (nr) from NCBI. That is why you saw the gene name with 'hs' or 'nr'.

In the table I shared, you will find many genes were still missing gene symbols.

Hope this is helpful and please let me know if you have any other questions.

lengfei5 commented 1 year ago

Dear @KenjiKamimoto-wustl122

Congrats to you for your paper finally out !! Please let me know if there is anything I can help to put our axolotl genome in the amazing cellOracle.

Many thanks in advance.

Best,

KenjiKamimoto-ac commented 1 year ago

@lengfei5

Thank you so much! And I'm so sorry for taking time! I will find time and do it by the end of the month! Sorry for the inconvenience!

lengfei5 commented 1 year ago

Dear @KenjiKamimoto-wustl122 No worries and thank you in advance for working on our axolotl genome. Just let me know if there is anything I can help.

Best, Jingkui

KenjiKamimoto-ac commented 1 year ago

@lengfei5

I'm sorry for the slow response! Finally, I could find enough time to make data for TSS annotation, which is required for CellOracle Axolotl analysis. We also need to make TF binding motif data. In most cases, we use the CIS-BP dataset to make motif data, and I was expecting that CIS-BP has Axolotl data. But now I realize they don't have it. Do you know about the TF binding motif database for Axolotl? We need a database that includes 1) TF binding Motif data and 2) TF gene symbol data that bind to the motif. I cannot do CellOracle analysis without this data.

lengfei5 commented 1 year ago

Dear @KenjiKamimoto-wustl122 ,

Thank you for your updating and your effort, and I am looking forwards so much.

Unfortunately there is no axolotl-specific motif database in CIS-BP.

1) The current best practice is to use JASPAR version 2022 CORE motifs of vertebrates https://jaspar.uio.no/search?q=&collection=CORE&tax_group=vertebrates

2) The axolotl TFs that binding to the motif can be associated by matching the gene symbols. In particular, the gene symbols from JASPAR should be uppercase first before matching to axolotl gene symbols, because their naming is not uniformly uppercase.

Many thanks and let me know if you have any other questions.

KenjiKamimoto-ac commented 1 year ago

@lengfei5 Got it! Thank you so much for your help! I will use JASPAR 2022 vertebrates data and I will incorporate everything. I estimate I can do that and update CellOracle by Friday. Thank you for your patience and help again.

lengfei5 commented 1 year ago

Dear @KenjiKamimoto-wustl122 ,

Soooooo exciting to hear that and many people in Tanaka group are really looking forwards to it. Many thanks again.

KenjiKamimoto-ac commented 1 year ago

@lengfei5

I'm sorry, but let me ask one more question. I'm testing new functions using actual Axolotl scRNA-seq data to make sure the new data and function for axolotl is correct.

For example, I downloaded one scRNA-seq data here for the test. https://zenodo.org/record/6390083#.ZAD8UC16NhE When I looked at gene names, I found genes like this.

スクリーンショット 2023-03-02 午後1 38 35

I tried to check that the gene id (or gene name) matches TSS data. It seems there is a gene_id discrepancy between the scRNA-seq and the TSS. For example, gene id in the scRNA-seq data, such as "AMEX60DD20~" is not found in amexT_v47_hoxPatch.txt nor gtf file in https://www.axolotl-omics.org/dl/AmexT_v47-AmexG_v6.0-DD.gtf.gz

Also, it seems the gene_id name in amexT_v47_hoxPatch.txt and gtf file in https://www.axolotl-omics.org/dl/AmexT_v47-AmexG_v6.0-DD.gtf.gz is not same.

Which version should I use? And If possible, can you share some scRNA-seq data to test the celloracle function?

Or, I think it is also possible to just ignore the unannotated gene, such as "AMEX60....", and just focus on the gene with the gene name. In that case, I will just make celloracle codes focusing on them, and the gene id version does not matter. It is easy. But in this case, we cannot make a GRN edge between unannotated genes. Do you think it is OK?

I'm sorry for keep asking questions. I'm not familiar with Axolotl gene analysis customs. I would appreciate you could give me instructions about that.

lengfei5 commented 1 year ago

Dear @KenjiKamimoto-wustl122 ,

I am glad that you shared the issue with me and I am sorry that you have to go through our confusing axolotl annotation, which is far from being optimized.

1) I have double checked the file (https://www.axolotl-omics.org/dl/AmexT_v47-AmexG_v6.0-DD.gtf.gz) you are using is the latest version of GTF file and there is no issue for this file.

The issue you have noticed was for many gene IDs, there are multiple gene symbols in the columns of gene_name and/or transcript_id.

In addition this GTF, transcript ids and gene ids that were annotated from contigs (previous version of genome assembly) were still kept together with the those annotated in the scaffolds.

2) I have double checked the amexT_v47_hoxPatch.txt which only only genes () in the scaffolds, whereas genes from contigs were discarded, because nowadays, people should map their data directly to the scaffold. So the gene_id in amexT_v47_hoxPatch should be a subset of gene_id in AmexT_v47-AmexG_v6.0-DD.gtf.gz.

One convention with axolotl genes is that we use the gene symbols if there are, otherwise we used gene.id. I just realized that probably I should do that for the TSS file, In addition, I just gave one TSS for each gene in amexT_v47_hoxPatch.txt; however, there are usually multiple transcripts, potentially multiple TSSs, should I extend the file of amexT_v47_hoxPatch.txt like that ?

3) I have quickly checked the "AMEX60DD20~" that seems to be transcript ID, instead of gene ID, could you tell me which rds file you are testing ? so that I can check it ?

I agree with you that without unannotated genes, we will lose lots of TF targets, because we have only ~15k gene symbols in total.

Many thanks again for your effort and we all appreciate it.

Best, Jingkui

lengfei5 commented 1 year ago

Dear @KenjiKamimoto-wustl122 ,

I am not if you have time to cope the axolotl gene annotation problem. But please let me know if there is anything I can help.