kunwang34 / PhyloVelo

PhyloVelo, Phylogeny-based transcriptomic velocity of single cells
https://phylovelo.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
40 stars 4 forks source link

C. elegans AB Lineage Using Pseudoembryo0 Data #12

Closed AvitalRodov closed 6 months ago

AvitalRodov commented 6 months ago

Hello! Firstly, I'd like to express my gratitude for providing phyloVelo! I'm currently intrigued by the prospect of reproducing the phylogeny tree of C. elegans for the AB lineage using the pseudoembryo0 data, before applying phyloVelo (as depicted in figure 3a of the paper https://www.nature.com/articles/s41587-023-01887-5). I've noticed the inclusion of the Trie class in elegans_util.py, but unfortunately, I couldn't locate any usage examples or comprehend its application in converting it to a Newick tree format. Could you please guide how to reproduce a lineage tree in Newick tree format for cells in pseudoembryo0?

Thank you in advance!

AvitalRodov commented 6 months ago

In notebooks/Embryo1_all.ipynb, Embryo2_all.ipynb and Embryo3_all.ipynb there is a mention of trees/embryo_all.newick that can be found at data_path = '/data3/wangkun/phylovelo_datasets/embryo/'. Is there an option to provide these files with the matching count and metadata? :)

kunwang34 commented 6 months ago

Hi @AvitalRodov For linage tree of C.elegans, you can implement this using the Bio.Phylo.BaseTree module in the Biopython package. The naming convention for cells in the elegnas dataset is to inherit the name of the parent node and then add a letter at the end. Therefore, you can create a Clade as the root, and then add child nodes based on the cell names. For example, if you have three cells named aab, aba, and abb, you can first create a root clade, root = Phylo.BaseTree.Clade(name='a'), then add two child nodes c1 = Phylo.BaseTree.Clade(name='aa'), c2 = Phylo.BaseTree.Clade(name='ab'), root.clades = [c1, c2]. Similarly, add a child node named 'aab' to c1, and child nodes named 'aba' and 'abb' to c2. Finally, export the tree in Newick format using Phylo.write(root, save_path, format='newick').

For the mouse embryonic development data, you can obtain normalized data from the original study (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117542), which also suitable for PhyloVelo. You can also follow the original study (https://www.nature.com/articles/s41586-019-1184-5#Sec12) analysis RNA-seq data using cellranger to get the raw read count we used in the notebook.

AvitalRodov commented 6 months ago

Hi @kunwang34, Thank you so much for the detailed and fast response! I understand your explanation about the cell lineage used to create the tree. However, in the pseudoembryo0 example, I encountered some discrepancies when trying to construct a tree from the provided data.

In pseudoembryo0, we have two cells with cell generation 5: {'TAAGAGATCATGCATG-r17': 'ABaxx', 'GACGGCTTCACATAGC-r17': 'ABpxp'} However, in cell generation 6, the lineages are as follows:

{'CGATGGCGTAGTGAAT-b01': 'ABarpx',
 'CGAGCCAGTGCAACGA-r17': 'ABpxax',
 'GAAATGATCACGATGT-r17': 'ABpxpa',
 'ACGTCAAAGTGGAGTC-b01': 'ABpxpp'}

I can't understand how 'CGAGCCAGTGCAACGA-r17' with the lineage 'ABpxax' can be considered a child of 'TAAGAGATCATGCATG-r17' with the lineage 'ABaxx'.

Additionally, the next generation contains 16 cells instead of the expected 8, and their lineages do not always match those provided for generation 6.

Could you please provide further clarification on this? Thank you very much for your help!

kunwang34 commented 6 months ago

The scRNA-seq datasets for C. elegans represent a composite of cells from multiple individuals. As such, the pseudoembryo0 is an artificial construct that combines cells with distinct lineages from these varied datasets. It is crucial to recognize that the cells within a pseudoembryo do not all originate from a single organism; rather, they are randomly sampled from the collective pool of data.

Occasionally, some scRNA-seq data may be incomplete or missing due to experimental limitations. For example, the lineage ‘ABpxax’ appears to be a descendant of ‘ABpxa’, yet ‘ABpxa’ is absent from the scRNA-seq data. This type of discrepancy can also explain the unexpected variance in cell numbers across generations.

If your research aims to investigate the relationship between cell generation and gene expression, akin to the approach used in PhyloVelo, these inconsistencies may not significantly impact your analysis. However, if your goal is to reconstruct a complete cell lineage, further filtration of the cells may be necessary to account for the missing data.

AvitalRodov commented 6 months ago

Thank you very much!