How to access to trajectory and expression data for benchmarking

nathanmaulding commented 1 year ago

Thanks for the thorough tutorial! I've been developing a tool in a parallel space and I'm interested in accessing some of the data to use for benchmarking. Specifically, the expression data, cell's trajectory branch assignments, and the full dictys results for the same data.

I'm sure the data may be formatted differently than described, but pointing me in the right direction would help a lot! Great work! Thanks in advance!

lingfeiwang commented 1 year ago

Thank you for the interest! Could you let us know exactly what data you are looking for? For example, which figure or text in our publication would you quote?

nathanmaulding commented 1 year ago

I appreciate your quick response! For example, figure 5 in the paper, which I think corresponds to the analysis-blood tutorial. So the STREAM trajectory information on the cells and their lineage, the rna expression matrix that corresponds to those cells, and the full Dictys results for this dataset (not just the visualizations).

lingfeiwang commented 1 year ago

Sure. The data is available at https://zenodo.org/record/7226859. After downloading and decompressing analysis-blood.tar.xz, you can load the network object with d=dictys.net.dynamic_network.from_file('data/dynamic.h5').

Dictys inputs you requested:

d.traj: trajectory information. Containing branching and end points and their edges in the cell lineage.
d.point: cell (d.point['c']) and state (d.point['s']) information. Contains the edge and location on the edge of each cell and state. Each state is an intermediate, branching, or ending point on the trajectory where a static network is inferred from data of its nearby cells.
d.prop['sc']['w']: binary assignment of whether data of each cell is used to infer static network for each state.
d.?name: Names of each cell, node (gene), or state. Applicable to all other variables containing these axes. ?=c,n,s.
d.prop['nc']['readcount']: Expression read count matrix.

Dictys outputs you requested:

d.nids: Node (gene) indices in d.nname for candidate regulator TFs (d.nids[0]) and candidate target genes (d.nids[1]). Defines the dimensions in d.prop['es']['w'] below.
d.prop['es']['w']: GRN inferred for each state.

You can find the class definitions at:

Let me know if you have any further questions.

lingfeiwang commented 1 year ago

And here are the Dictys input of cell types and low dimensional coordinates:

d.prop['c']

nathanmaulding commented 1 year ago

Thanks! Is there also a list of genes that correspond to the ids I'm seeing in the expression readcount matrix and in d.nids? For the GRN inferred state in d.prop['es']['w'] the shape is (325, 10006, 143) does that correspond to (Regulator, targets, GRN at given time/state)? I'm also having trouble seeing how the trajectory object would correspond to pseudotime assignments for each cell? Thanks so much!

lingfeiwang commented 1 year ago

Sure. The gene/cell/state names are in d.?name with ?=n/c/s. See https://github.com/pinellolab/dictys/blob/6e0e003a3272963337736ed976428b95edfc376a/src/dictys/net/__init__.py#L27

For the GRN inferred state in d.prop['es']['w'] the shape is (325, 10006, 143) does that correspond to (Regulator, targets, GRN at given time/state)?

Yes. You can infer the meaning of each dimension by their length.

I'm also having trouble seeing how the trajectory object would correspond to pseudotime assignments for each cell?

Pseudotime is the distance from the differentiation starting point to the current point along the trajectory, and needs to be computed. You need to input the index of the starting point during trajectory inference, which is 0 here. Then you can do a minus operation to get the distance from the starting point to any point:

start=0
distance=d.traj.topoint()[[start]]-d.point['c']

See https://github.com/pinellolab/dictys/blob/6e0e003a3272963337736ed976428b95edfc376a/src/dictys/traj.py#L583

Note that the above steps only output raw GRNs at each point, but not the smoothed ones used in our analysis. If you just want the final smoothed GRNs from the start to the end of each differentiation path, you can export them to a folder. See the bottom of https://github.com/pinellolab/dictys/blob/master/doc/tutorials/analysis-blood/notebooks/dynamic/main.ipynb.

nathanmaulding commented 1 year ago

Using the final smoothed GRNs from the bottom of the https://github.com/pinellolab/dictys/blob/master/doc/tutorials/analysis-blood/notebooks/dynamic/main.ipynb notebook, how is the best way to access the top results predicted by Dictys? Or is there a better way altogether to access the top TF-target relationships predicted?

lingfeiwang commented 1 year ago

The best way is to export the dynamic GRN to folder at the bottom of the notebook. You can then access and analyze the whole GRN including the top TF-target relationships with any programming language you want.

nathanmaulding commented 1 year ago

I see. From d0.prop['c']['type'] it looks like these are the cell counts

(array(['B', 'CLP', 'Erythroid', 'GMP', 'Mono', 'PreB', 'Progenitor'], array([ 661, 1272, 2091, 3137, 2704, 709, 3680]))

And in the page in looks like B, erythroid, and Mono cells were used. Meaning that the other remaining cells, like progenitor for example. Aren't used in the analysis, correct? It looks that way from figure 5a too...

lingfeiwang commented 1 year ago

Actually all cell types were used in the analysis. The saved notebook only included the final frame for each video to keep it slim. I suggest to actually run the tutorial to see how it works.

nathanmaulding commented 1 year ago

Yeah I've gone through the tutorial. Only Mono, Erythroid, and B cells were mentioned in that. But maybe that is just to demonstrate what could be done on any of the cell types used.

lingfeiwang commented 1 year ago

If you generated and watched the videos, you should be able to see how the selected cells move continuously from progenitors to each of the three lineages through intermediate cell types.

nathanmaulding commented 1 year ago

How should I interpret the GRNs output from the end of the tutorial. I selected 10 GRN's, which I see. The scores have some negatives and positives, is this referring to the inhibition or activation? or something else? How would a strong relationship be defined?

I'm going to score how many annotated relationships the tool finds to measure its accuracy vs others. Should I use the absolute value as the level of confidence the algorithm places on the link?

lingfeiwang commented 1 year ago

Yes. We used absolute values for comparison.

pinellolab / dictys

How to access to trajectory and expression data for benchmarking #29