maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
105 stars 42 forks source link

Error when trying to recalculate markers #128

Closed matthewspeir closed 5 years ago

matthewspeir commented 5 years ago

When trying to recalculate the markers for a dataset, I get the following error:

INFO:root:Finding top markers for each cluster
Traceback (most recent call last):
  File "/cluster/home/mspeir/miniconda3/bin/cbScanpy", line 10, in <module>
    sys.exit(cbScanpyCli())
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4680, in cbScanpyCli
    adata, params = cbScanpy(matrixFname, metaFname, inCluster, confFname, figDir, logFname)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4525, in cbScanpy
    sc.tl.rank_genes_groups(adata, clusterField)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/scanpy/tools/_rank_genes_groups.py", line 119, in rank_genes_groups
    adata, groups_order, groupby)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/scanpy/utils.py", line 649, in select_groups
    groups_order = adata.obs[key].cat.categories
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 4368, in __getattr__
    return object.__getattribute__(self, name)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/accessor.py", line 133, in __get__
    accessor_obj = self._accessor(obj)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2378, in __init__
    self._validate(data)
  File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2387, in _validate
    raise AttributeError("Can only use .cat accessor with a "
AttributeError: Can only use .cat accessor with a 'category' dtype

Command:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'Louvain Cluster'

Files:

/hive/data/inside/cells/datasets/tabula-muris-senis/tms-bat-facs

(Also is it really using the input clusters to recalc the marker genes? I see this in the output which makes me think it's not: INFO:root:Found 22 louvain clusters...)

maximilianh commented 5 years ago

The Louvain message is a left over from the original code. I’ll change it.

As far as the error is concerned, I’m afraid that this is another scanpy change again. I’ll look into it tomorrow... unfortunately that’s a bit late for you...

On Tue 13 Aug 2019 at 22:47, Matt Speir notifications@github.com wrote:

When trying to recalculate the markers for a dataset, I get the following error:

INFO:root:Finding top markers for each cluster Traceback (most recent call last): File "/cluster/home/mspeir/miniconda3/bin/cbScanpy", line 10, in sys.exit(cbScanpyCli()) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4680, in cbScanpyCli adata, params = cbScanpy(matrixFname, metaFname, inCluster, confFname, figDir, logFname) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/cellbrowser/cellbrowser.py", line 4525, in cbScanpy sc.tl.rank_genes_groups(adata, clusterField) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/scanpy/tools/_rank_genes_groups.py", line 119, in rank_genes_groups adata, groups_order, groupby) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/scanpy/utils.py", line 649, in select_groups groups_order = adata.obs[key].cat.categories File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 4368, in getattr return object.getattribute(self, name) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/accessor.py", line 133, in get accessor_obj = self._accessor(obj) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2378, in init self._validate(data) File "/cluster/home/mspeir/miniconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py", line 2387, in _validate raise AttributeError("Can only use .cat accessor with a " AttributeError: Can only use .cat accessor with a 'category' dtype

Command:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'Louvain Cluster'

Files:

/hive/data/inside/cells/datasets/tabula-muris-senis/tms-bat-facs

(Also is it really using the input clusters to recalc the marker genes? I see this in the output which makes me think it's not: INFO:root:Found 22 louvain clusters...)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/128?email_source=notifications&email_token=AACL4TJE2PMFMXG2JNLOMR3QEMMVVA5CNFSM4ILOUIQKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HFBXJIQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TLUEI477NTRT4PYPSTQEMMVVANCNFSM4ILOUIQA .

maximilianh commented 5 years ago

thanks for your email reminder. This also happens in my scanpy and I'm just looking, it didn't happen for the other datasets... looking...

maximilianh commented 5 years ago

The problem is the Louvain Cluster was auto-detected to be in "number" format instead of the usual "category" format (e.g. "Cluster 1" is "category", but just "1" is "number"). I'm pretty sure that this used to work, as I know I've had this problem before and even opened an issue over in the scanpy github and it went away. But I'll add the same fix I've used before to force it to category format now.

maximilianh commented 5 years ago

Now that I've fixed the data type of the Louvain Cluster, it's complaining that the mean of some gene is 0. I think this could be considered a bug in scanpy, but let's ignore that for now. I think I need to run highly variable genes first before finding the markers and that function needs to know if the matrix has been log'ed before or not... sigh... not sure what to do.... I'll log the matrix for now I guess, and we'll think about it later...

matthewspeir commented 5 years ago

Thanks, Max!

When you say "it didn't happen for the other datasets...", did you use the 'Louvain Cluster' field as the cluster field? Looking at the FACS bladder dataset, it looks like you used the 'free_annotation' field rather than the 'Louvain Cluster' field. I thought we were going with the Louvain Cluster field for the label, not cell type.

maximilianh commented 5 years ago

No wait, what I wrote is true in general, but not for this particular dataset, as it's been log'ed already.

The problem I think is cluster 4: it contains only a single cell. That crashes the marker gene step (which is probably another scanpy bug). A cluster with a single cell doesn't make a lot of sense... I'll try something to remove this cell, but probably we should ask Angela is there is not something wrong here...

maximilianh commented 5 years ago

OK I've committed something, it contains a lot of changes. Can you do the --pre upgrade? 0.6a2, it'll break something else but at least should get you over this. The Scanpy/Pandas combo is a nightmore for me to work with, they have very different conventions and scanpy still breaks on so many things. It may also not work with your scanpy version in which case you probably should upgrade your scanpy (it's a bit more mature right now)

matthewspeir commented 5 years ago

Seems to have worked well. Although the cluster names in the output markers file don't seem to match those in the 'Louvain Cluster' input column. The input 'Louvain Cluster' column contains names like 1, 2, 3, etc. whereas my output markers file now has names like '0_B cell'.

maximilianh commented 5 years ago

Wow that’s a weird bug. What is the name of the field that contains “0_B cell” ? “louvain” ?

On Wed 4 Sep 2019 at 18:45, Matt Speir notifications@github.com wrote:

Seems to have worked well. Although the cluster names in the output markers file don't seem to match those in the 'Louvain Cluster' input column. The input 'Louvain Cluster' column contains names like 1, 2, 3, etc. whereas my output markers file now has names like '0_B cell'.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/128?email_source=notifications&email_token=AACL4TOFBHOTDJ67Z3XNJV3QH7Q2FA5CNFSM4ILOUIQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD54G6QQ#issuecomment-527986498, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TOOGIRDARDQL7WMNKLQH7Q2FANCNFSM4ILOUIQA .

matthewspeir commented 5 years ago

I think it's called 'cluster_names'

maximilianh commented 5 years ago

I can't reproduce this. In hive/data/inside/cells/datasets/tabula-muris-senis/facs/bat, I've run cbScanpy like this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'Louvain Cluster' --copyMatrix

(note that you really need --copyMatrix otherwise the matrix is really small now, as it will only copy the highly variable genes)

the resulting markers.tsv has the numbers of the "Louvain Cluster" field, as expected.

On Wed, Sep 4, 2019 at 10:03 PM Matt Speir notifications@github.com wrote:

I think it's called 'cluster_names'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

maximilianh commented 5 years ago

The output looks like this:

cluster_name gene z_score 0 Cd79a 80.506996 0 H2-DMb2 41.33267

When you say "it didn't happen for the other datasets...", did you use the 'Louvain Cluster' field as the cluster field? Looking at the FACS bladder dataset, it looks like you used the 'free_annotation' field rather than the 'Louvain Cluster' field. I thought we were going with the Louvain Cluster field for the label, not cell type. Oh, right, we've changed the fields. Sorry, so just to confirm that the new version does work as espected, I now ran this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'cluster_names' --copyMatrix

And the resulting markers.tsv does have the new corrected cluster names in the first column:

cluster_name gene z_score 0_B cell Cd79a 80.506996 0_B cell H2-DMb2 41.33267 0_B cell Cd79b 28.653046 0_B cell Faim3 27.186304

So all seems to work fine. There were a few cellbrowser.conf files with the wrong fields in facs/, I've fixed them up now.

Do you want to run a for loop over all directories now to calculate the markers or shall I do it?

It should be a command like this:

for i in find -type d | cut -c3-; do cd $i; cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o recalc -n --inCluster 'cluster_names' --copyMatrix; cbAnnotateMarkers recalc/markers.tsv recalc/markers_annot.tsv; cd ..; done

You can then fix up the marker file pointers:

sed -i 's|markers.annotated.tsv|recalc/markers_annotated.tsv|' */cellbrowser.conf

(Actually I just did this, so no need to do this anymore for the facs datasets)

And rebuild all the cell browsers:

cbBuild -r

On Thu, Sep 5, 2019 at 2:04 PM Maximilian Haeussler maximilianh@gmail.com wrote:

I can't reproduce this. In hive/data/inside/cells/datasets/tabula-muris-senis/facs/bat, I've run cbScanpy like this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'Louvain Cluster' --copyMatrix

(note that you really need --copyMatrix otherwise the matrix is really small now, as it will only copy the highly variable genes)

the resulting markers.tsv has the numbers of the "Louvain Cluster" field, as expected.

On Wed, Sep 4, 2019 at 10:03 PM Matt Speir notifications@github.com wrote:

I think it's called 'cluster_names'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

maximilianh commented 5 years ago

Also, I just noticed: I should probably rename the "clusterField" setting to "defaultColorField", don't you think? clusterField makes little sense.

(I'll stay backwards compatible, so it'll still look for "clusterField" if "defaultColorField" is not found)

On Thu, Sep 5, 2019 at 2:32 PM Maximilian Haeussler maximilianh@gmail.com wrote:

The output looks like this:

cluster_name gene z_score 0 Cd79a 80.506996 0 H2-DMb2 41.33267

When you say "it didn't happen for the other datasets...", did you use the 'Louvain Cluster' field as the cluster field? Looking at the FACS bladder dataset, it looks like you used the 'free_annotation' field rather than the 'Louvain Cluster' field. I thought we were going with the Louvain Cluster field for the label, not cell type. Oh, right, we've changed the fields. Sorry, so just to confirm that the new version does work as espected, I now ran this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'cluster_names' --copyMatrix

And the resulting markers.tsv does have the new corrected cluster names in the first column:

cluster_name gene z_score 0_B cell Cd79a 80.506996 0_B cell H2-DMb2 41.33267 0_B cell Cd79b 28.653046 0_B cell Faim3 27.186304

So all seems to work fine. There were a few cellbrowser.conf files with the wrong fields in facs/, I've fixed them up now.

Do you want to run a for loop over all directories now to calculate the markers or shall I do it?

It should be a command like this:

for i in find -type d | cut -c3-; do cd $i; cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o recalc -n --inCluster 'cluster_names' --copyMatrix; cbAnnotateMarkers recalc/markers.tsv recalc/markers_annot.tsv; cd ..; done

You can then fix up the marker file pointers:

sed -i 's|markers.annotated.tsv|recalc/markers_annotated.tsv|' */cellbrowser.conf

(Actually I just did this, so no need to do this anymore for the facs datasets)

And rebuild all the cell browsers:

cbBuild -r

On Thu, Sep 5, 2019 at 2:04 PM Maximilian Haeussler maximilianh@gmail.com wrote:

I can't reproduce this. In hive/data/inside/cells/datasets/tabula-muris-senis/facs/bat, I've run cbScanpy like this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'Louvain Cluster' --copyMatrix

(note that you really need --copyMatrix otherwise the matrix is really small now, as it will only copy the highly variable genes)

the resulting markers.tsv has the numbers of the "Louvain Cluster" field, as expected.

On Wed, Sep 4, 2019 at 10:03 PM Matt Speir notifications@github.com wrote:

I think it's called 'cluster_names'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

maximilianh commented 5 years ago

Never mind about running these commands, I've modified them a little to run them in parallel, that should be a lot faster. I'll report back how this went. It may be useful for other projects.

On Thu, Sep 5, 2019 at 2:33 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Also, I just noticed: I should probably rename the "clusterField" setting to "defaultColorField", don't you think? clusterField makes little sense.

(I'll stay backwards compatible, so it'll still look for "clusterField" if "defaultColorField" is not found)

On Thu, Sep 5, 2019 at 2:32 PM Maximilian Haeussler maximilianh@gmail.com wrote:

The output looks like this:

cluster_name gene z_score 0 Cd79a 80.506996 0 H2-DMb2 41.33267

When you say "it didn't happen for the other datasets...", did you use the 'Louvain Cluster' field as the cluster field? Looking at the FACS bladder dataset, it looks like you used the 'free_annotation' field rather than the 'Louvain Cluster' field. I thought we were going with the Louvain Cluster field for the label, not cell type. Oh, right, we've changed the fields. Sorry, so just to confirm that the new version does work as espected, I now ran this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'cluster_names' --copyMatrix

And the resulting markers.tsv does have the new corrected cluster names in the first column:

cluster_name gene z_score 0_B cell Cd79a 80.506996 0_B cell H2-DMb2 41.33267 0_B cell Cd79b 28.653046 0_B cell Faim3 27.186304

So all seems to work fine. There were a few cellbrowser.conf files with the wrong fields in facs/, I've fixed them up now.

Do you want to run a for loop over all directories now to calculate the markers or shall I do it?

It should be a command like this:

for i in find -type d | cut -c3-; do cd $i; cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o recalc -n --inCluster 'cluster_names' --copyMatrix; cbAnnotateMarkers recalc/markers.tsv recalc/markers_annot.tsv; cd ..; done

You can then fix up the marker file pointers:

sed -i 's|markers.annotated.tsv|recalc/markers_annotated.tsv|' */cellbrowser.conf

(Actually I just did this, so no need to do this anymore for the facs datasets)

And rebuild all the cell browsers:

cbBuild -r

On Thu, Sep 5, 2019 at 2:04 PM Maximilian Haeussler maximilianh@gmail.com wrote:

I can't reproduce this. In hive/data/inside/cells/datasets/tabula-muris-senis/facs/bat, I've run cbScanpy like this:

cbScanpy -e exprMatrix.tsv.gz -m meta.tsv -o tms-bat-facs_marker-recalc -n tms-bat-facs --inCluster 'Louvain Cluster' --copyMatrix

(note that you really need --copyMatrix otherwise the matrix is really small now, as it will only copy the highly variable genes)

the resulting markers.tsv has the numbers of the "Louvain Cluster" field, as expected.

On Wed, Sep 4, 2019 at 10:03 PM Matt Speir notifications@github.com wrote:

I think it's called 'cluster_names'

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

matthewspeir commented 5 years ago

Yeah, I think 'defaultColorField' makes more sense now, but I agree that it would nice to keep it backward compatible.

maximilianh commented 5 years ago

I am accepting defColorField now, just need to update the documentation.

The final commands that I ran were:

for i in find -type d | cut -c3-; do echo cbScanpy -e $i/exprMatrix.tsv.gz -m $i/meta.tsv -o $i/recalc -n $i --inCluster 'cluster_names' --copyMatrix; done > commands.txt parallel --jobs 10 < commands.txt

for i in find -maxdepth 1 -type d | cut -c3-; do echo cbMarkerAnnotate $i/recalc/markers.tsv $i/recalc/markers_annot.tsv; done > commands2.txt parallel --jobs 10 < commands2.txt

I did a lot of marker gene recalculation now and I think we can close this. Until it breaks again. :-)

On Thu, Sep 5, 2019 at 7:50 PM Matt Speir notifications@github.com wrote:

Yeah, I think 'defaultColorField' makes more sense now, but I agree that it would nice to keep it backward compatible.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/128?email_source=notifications&email_token=AACL4TJVRP35EO2WG6R2ZQDQIFBHPA5CNFSM4ILOUIQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ADNSY#issuecomment-528496331, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TOZ3O2TRU4OKSYJ3ODQIFBHPANCNFSM4ILOUIQA .