minimal_examples not working (osx, both command line and jupyter notebook)

ktpolanski commented 6 years ago

Hello,

I've set up scanpy and tried to reproduce the minimal_examples notebook to begin with. This did not go well from the command line - sc.settings.set_figure_params(dpi=80) resulted in python printing an In: and refusing to cooperate, forcing a terminal restart. I moved over to the notebook, and the settings issue went away. However, the very next code block yielded a strange error:

    assuming last comment line stores variable names
    assuming first column in file stores row names
    read data into list of lists (0:00:15.332)
    constructed array from list of list (0:00:00.001)
... writing an h5 cache file to speedup reading next time
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-7-913ba7cbc7ec> in <module>()
----> 1 adata_krumsiek11 = sc.datasets.krumsiek11()
      2 X_blobs = sc.datasets.blobs(cluster_std=0.5, n_centers=2).X
      3 X_concatenated = np.r_[adata_krumsiek11.X, X_blobs]
      4 adata = sc.AnnData(X_concatenated)
      5 adata.var_names = adata_krumsiek11.var_names

~/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/datasets/builtin.py in krumsiek11()
     76                   'by running `sc.tl.sim("krumsiek11")`'
     77                   .format(filename))
---> 78     adata = sc.read(filename, first_column_names=True, cache=True)
     79     adata.uns['iroot'] = 0
     80     fate_labels = {0: 'progenitor', 159: 'monocyte', 319: 'erythrocyte',

~/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/readwrite.py in read(filename, sheet, ext, delimiter, first_column_names, backup_url, return_dict, cache)
     78                       first_column_names, backup_url, cache)
     79         if isinstance(data, dict):
---> 80             return data if return_dict else AnnData(data)
     81         elif isinstance(data, AnnData):
     82             return data._to_dict_fixed_width_arrays() if return_dict else data

~/anaconda3/envs/orig/lib/python3.5/site-packages/anndata-0.3.0.3-py3.5.egg/anndata/anndata.py in __init__(self, data, smp, var, uns, smpm, varm, dtype, single_col)
    298                 raise ValueError(
    299                     'If `data` is a dict no further arguments must be provided.')
--> 300             data, smp, var, uns, smpm, varm = self._from_dict(data)
    301 
    302         # check data type of data

~/anaconda3/envs/orig/lib/python3.5/site-packages/anndata-0.3.0.3-py3.5.egg/anndata/anndata.py in _from_dict(self, ddata)
    880         uns = ddata
    881 
--> 882         return data, smp, var, uns, smpm, varm
    883 
    884     def _clean_up_old_format(self, uns):

UnboundLocalError: local variable 'data' referenced before assignment

In spite of the error, the code appeared to do something - a directory called write appeared in my working directory (/Users/kp9), with a deeply nested set of folders leading to an .h5 file (/Users/kp9/write/data/Users/kp9/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/datasets/krumsiek11.h5). For some reason, once that file got parked there, re-running the failed code block worked. Strange.

After this, the example works until it's time to plot the abstracted graph:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-b5f82ec1cee3> in <module>()
      3                 title='single-cell graph', save=True,
      4                 root=[7, 10, 11], layout='rt',
----> 5                 title_graph='abstracted graph')

~/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/plotting/tools.py in aga(adata, basis, color, alpha, groups, components, projection, legend_loc, legend_fontsize, legend_fontweight, color_map, palette, size, title, right_margin, left_margin, show, save, ext, title_graph, groups_graph, color_graph, **aga_graph_params)
    472                 ax=axs[0],
    473                 show=False,
--> 474                 save=False)
    475     aga_graph(adata, ax=axs[1], show=False, save=False, title=title_graph,
    476               groups=groups_graph, color=color_graph, **aga_graph_params)

~/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/plotting/tools.py in aga_scatter(adata, basis, color, alpha, groups, components, projection, legend_loc, legend_fontsize, legend_fontweight, color_map, palette, size, title, right_margin, show, save, ax)
    555                  title=title,
    556                  ax=ax,
--> 557                  show=False)
    558     utils.savefig_or_show('aga_' + basis, show=show, save=save)
    559     return ax

~/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/plotting/ann_data.py in scatter(adata, x, y, color, alpha, basis, groups, components, projection, legend_loc, legend_fontsize, legend_fontweight, color_map, palette, right_margin, left_margin, size, title, show, save, ax)
    178                                  + ' specify valid sample annotation, one of '
    179                                  + str(adata.smp_keys()) + ' or a gene name '
--> 180                                  + str(adata.var_names))
    181             colorbars.append(True if continuous else False)
    182         if categorical: categoricals.append(ikey)

ValueError: "aga_groups" is invalid! specify valid sample annotation, one of ['X_diffmap0', 'louvain_groups', 'aga_pseudotime'] or a gene name ['Gata2' 'Gata1' 'Fog1' 'EKLF' 'Fli1' 'SCL' 'Cebpa' 'Pu.1' 'cJun' 'EgrNab'
 'Gfi1']

In case this is of relevance, the output of the previous code block (the one using Louvain clustering) was slightly different than what's shown in the notebook on GitHub, including a few warnings:

running Louvain clustering
    using data matrix X directly for building graph (no PCA)
    computing data graph with n_neighbors = 30 
    computing spectral decomposition ("diffmap") with 10 components
    eigenvalues of transition matrix
    [ 1.            1.            1.            0.998418808   0.9969582558
      0.9920811653  0.9909049273  0.9824624062  0.9651805162  0.9614249468]
    using the "louvain" package of Traag (2017)
    finished (0:00:00.902): found 15 clusters and added
    'louvain_groups', the cluster labels (adata.smp, dtype=category)
running Approximate Graph Abstraction (AGA)
    using data matrix X directly for building graph (no PCA)
    computing data graph with n_neighbors = 30 
    computing spectral decomposition ("diffmap") with 10 components
    eigenvalues of transition matrix
    [ 1.            1.            1.            0.998418808   0.9969582558
      0.9920811653  0.9909049273  0.9824624062  0.9651805162  0.9614249468]
    abstracted graph will have 15 nodes
    finished (0:00:00.800)
/Users/kp9/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/tools/aga.py:614: RuntimeWarning: divide by zero encountered in true_divide
  self.segs_adjacency_full_attachedness = 1/segs_distances
/Users/kp9/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/tools/aga.py:630: RuntimeWarning: divide by zero encountered in true_divide
  segs_distances = 1/full_attachedness
/Users/kp9/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy-0.2.9.1+45.g4c64b30.dirty-py3.5-macosx-10.6-x86_64.egg/scanpy/tools/aga.py:215: RuntimeWarning: divide by zero encountered in true_divide
  1./aga.segs_adjacency_full_attachedness)

Any idea what may be going on and how to fix it? Thanks, and sorry for the trouble.

falexwolf commented 6 years ago

Sorry about that, the master branch of Scanpy unfortunately was unstable during the past days... Scanpy underwent some core changes. This won't happen in the future.

Tonight, we released version 0.3, which is stable.

Also the master banch will be kept stable, so you can just git pull and the Scanpy API with all example notebooks here and in scanpy_usage will workout nicely. Only the command-line interface might still have some issues...

Cheers, Alex

ktpolanski commented 6 years ago

Yeah, I eventually managed to resolve that particular hurdle by pipping in 0.6 of louvain and not installing from source, and I got a weird tree that wasn't compatible with what was shown in the notebook, and crashed out on the next step due to an internal package typo, so I fixed the typo (which is gone from scanpy by now, so no need to report on it), and crashed out on the gene change along paths step for matplotlib reasons. At this stage I decided to reinstall everything clean from pip for some reason, and now bump into this:

... did not find original file /Users/kp9/anaconda3/envs/orig/lib/python3.5/site-packages/scanpy/datasets/krumsiek11.txt

I got the file off GitHub, rammed it in the specified location by hand (as it wasn't there for some reason) and continued. The whole notebook ran this time. However, the different output trees persisted:

15 cluster thing

Thing is, they're consistent. That's the second time I've gotten this bizarro tree. Some assistance please? Sorry about this.

falexwolf commented 6 years ago

Damn, sorry; the dataset is not PyPI... Grrr... Thanks for notifying me.

The tree is strange because the root cells are picked in a strange way. Choose roots root=[0, 2, 11], and it will look nice.

Instead you can also choose a the 'fr' layout, for example.

falexwolf commented 6 years ago

Hope this solves it https://github.com/theislab/scanpy/commit/e3afd9e41137dc5c49dd1afc4e21e4efb25080d0

Uploaded version 0.3.1 to PyPI for this.

ktpolanski commented 6 years ago

Oh, that did the trick! Still has 15 clusters in it, the previous step is still different between my machine and yours for some reason. Is this expected? Maybe in that case it'd be better to make the pseudotime plot before you make the tree and instruct the user to spot the progenitor based on where the pseudotime "flows" from? Come to think of it, a bit of commentary would probably help a lot.

better tree

So, seeing how I've graduated from replicating the demo to potentially being a user, here are a few potential questions (these may be of use when writing commentary too?):

This one's silly, but how do I actually format the input? The demo file here has a bunch of weird commented-out lines at the start, and there's no additional info for the cells. The paul15 and nestorowa16 datasets are both AWOL from the scanpy repository.
Is there any way to actually identify marker genes across the different trajectories in the tree? You plot heatmaps using markers you know beforehand, but when I'd ram in a dataset with tens of thousands of genes a de novo finder, akin to Seurat's cluster markers in a way, would be quite helpful.
Is there a way to colour the plots according to some sort of annotation/grouping imported on input? I'd port in data from four time points and it'd be quite interesting to see how the identified flow matches to the actual time. You may or may not be doing that already in the paul15/nestorova16 notebooks, there's some unannotated stuff which I'm having trouble figuring out (especially with the files not being there), some guidance would be much appreciated.

Sorry for all the hassle, but this looks like a super exciting tool and I'd like to get to use it as well as possible.

falexwolf commented 6 years ago

Don't bother about the details of the clustering. These depend on the resolution and other things, the topology is the important thing. Also see https://github.com/theislab/graph_abstraction/tree/master/minimal_examples#zooming-into-particular-regions-of-the-data.

the paul15 dataset is downloaded if you try to run it, doesn't this work for you?
the nestorowa16 dataset will be online on http://blood.stemcells.cam.ac.uk/single_cell_atlas.html tonight and well documented way
the pcmb3k dataset is part of the standard Seurat clustering tutorial https://github.com/theislab/scanpy_usage/tree/master/170505_seurat that you should definitely look at

All of these use cases start from scratch.

There is https://scanpy.readthedocs.io/en/latest/api/scanpy.api.tl.rank_genes_groups.html, which you can apply for terminal or intermediate clusters/partitions of the graph. You can also compare genes for different stages on the graph. Let me know if this works for you. See the standard Seurat clustering for how it's used.

Of course, simply type

adata = sc.read('mydata...')
adata.smp['my_time'] = pd.read_csv('annotations.csv')['my_time']
sc.tl.tsne(adata)
sc.pl.tsne(adata, color='time point')

Please tell me where you'd like find these things in the documentation...

ktpolanski commented 6 years ago

Sorry, I'm not trying to be difficult, and I hope I'm not being too ignorant. Yeah, I found the Seurat-like tutorial and enjoyed it quite a bit, hence the suggestion to possibly add some more description to the notebooks. I'll admit, I didn't try the paul15 thing as I looked at the github and didn't find the file there, so I just assumed it was missing. Didn't think that it'd be automatically downloaded upon calling, sorry about that.

So I guess what you're saying is that if I have, say, 1 -> 2 -> 3 in the graph then I could use the Seurat-like marker identification to find 1 -> 2 and 2 -> 3 markers and use those for the heatmap? yeah, that does make sense. Thanks for the little code snippet, I guess I can colour the FR plot in a similar manner once I have the stuff loaded up.

falexwolf commented 6 years ago

You're not difficult at all! Your feedback is very helpful and I will try to make the documentation better. All of this is so recent that we might have forgotten to mention bits and pieces.

Yes, you can color the FR plot and any other visualizations you might prefer in the same way.

And yes, try the marker gene strategy. In parallel, we are also thinking about how to do this in the best way...

falexwolf commented 6 years ago

Now, the nestorowa16 example makes much more sense, as the data is available. https://github.com/theislab/graph_abstraction/blob/master/nestorowa16/nestorowa16.ipynb

ktpolanski commented 6 years ago

Great, thanks for putting the data up! Now I know how to format my input file.

If the documentation on that pairwise comparison thing you linked me to is anything to go on, you're very very close to having a fantastic, user-friendly package on your hands. However, compare this to this. They both technically accomplish the same goal - they show you how to use a package. However, the first link explains the key points of what it's doing via some commentary, while the second link just spits out a figure, hiding half the necessary procedures in comments without a word and storing the results as similarly unmentioned precomputed fields in the data it loads.

Your notebooks are nowhere near as bad as the second example, mind you, but you could use some more narrative injected directly into one of them - look at Seurat, they have very verbose guides for basic usage and dataset merging via CCA, and then they have more terse demonstration files of analyses on larger datasets. I found this repository from a link in your paper, I loaded it up, and I was greeted with a great readme page that detailed a number of different examples. Fantastic! Let's go to minimal_examples as that sounds like a good starting point for a newbie like me. Hey wait, this is minimally commented. Let's check paul15 and nestorowa16 in that case, hey wait, those are minimally commented too. You have a great front-end, and you seem to have great documentation for the advanced user too, but nothing that would transition me easily from this front-end absolute beginner to the advanced user stage managed to catch my eye.

Pretty much all you need to do is thoroughly annotate one of the notebooks, explaining or at least mentioning just about everything that may be of relevance to a potential user, and then advertise it heavily from the readmes that load up in the repository and other, less annotated notebooks. How can you format your input files, and how do you load them (minimal_examples is bad for that, as it just sneakily loads a dataset in the background)? What needs to be provided on input for the method, as you can get counts, log counts, TPMs, what have you? From stuff that's been mentioned in here - how do you actually visually identify the roots, especially as the threat of the clustering turning up different things is real? An approach that's helped me when writing this sort of stuff is assuming the user knows nothing or next to nothing, and detailing the key points of everything to make sure they're on the right page at all times before continuing. Just hold their hand through everything, at least the first time through. If you can imagine a user even potentially asking about something, spell it out just in case.

falexwolf commented 6 years ago

Thank you for this super detailed remark! I will take some time as soon as I can to follow your suggestions. Everything you say makes a lot of sense!

theislab / graph_abstraction

minimal_examples not working (osx, both command line and jupyter notebook) #5