maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
104 stars 41 forks source link

semicolons in cell names not properly handled #227

Closed dvera closed 2 years ago

dvera commented 3 years ago

I had semicolons in my cell names and the vast majority of cells are removed because they did not "appear" in the metadata

ubuntu@ip-172-31-24-90:~/creighton_wt$ sudo cbBuild -o /tmp/cb -p 80 INFO:root:dataRoot is not set in ~/.cellbrowser.conf or via $CBDATAROOT. Dataset hierarchies are not supported. INFO:root:Determining if /tmp/cb/creighton_wt/exprMatrix.tsv.gz needs to be created INFO:root:Reading headers from file /tmp/cb/creighton_wt/exprMatrix.tsv.gz INFO:root:current input matrix looks identical to previously processed matrix, same file size, same sample names INFO:root:/tmp/cb/creighton_wt/meta.tsv has the same md5 as in /tmp/cb/creighton_wt/dataset.json, no need to rebuild meta data INFO:root:Reading sample names from /tmp/cb/creighton_wt/meta.tsv INFO:root:Checking and reordering meta data to /tmp/cb/creighton_wt/meta.tsv INFO:root:Reading sample names from /home/ubuntu/creighton_wt/meta.tsv INFO:root:Reading headers from file /home/ubuntu/creighton_wt/exprMatrix.tsv.gz WARNING:root:14576 sample names are in the expression matrix, but not in the meta data. Examples: ['wta;TATCGCCGTCGAACGA-1', 'wtb;GTTGAACAGACGCATG-1', 'wta;GGCACGTCACAACGCC-1', 'wta;GTTTGGACACATTC TT-1', 'wtb;AGGTCTACAGCCTTCT-1', 'wtb;CTACTATGTAGATTGA-1', 'wtb;GTCAAGTCATCGATGT-1', 'wta;TCGTAGACATCGGCCA-1', 'wta;TTCCTAATCTGGCCTT-1', 'wtb;TCAGCCTCAGTTAGAA-1'] WARNING:root:These samples will be removed from the expression matrix, if possible INFO:root:Data contains 6 samples/cells [...]

maximilianh commented 3 years ago

Oh! Semicolons in IDs, that's unusual. R doesn't allow this as far as I know so there is code that "cleans" the IDs. Do you think you can share the matrix and meta.tsv with me @.*** ? That would be easiest, otherwise I can dig into the code and look for the replace where the "cleaning" of the meta IDs happens...

On Sun, Sep 12, 2021 at 8:30 PM Daniel Vera @.***> wrote:

I had semicolons in my cell names and the vast majority of cells are removed because they did not "appear" in the metadata

@.***:~/creighton_wt$ sudo cbBuild -o /tmp/cb -p 80 INFO:root:dataRoot is not set in ~/.cellbrowser.conf or via $CBDATAROOT. Dataset hierarchies are not supported. INFO:root:Determining if /tmp/cb/creighton_wt/exprMatrix.tsv.gz needs to be created INFO:root:Reading headers from file /tmp/cb/creighton_wt/exprMatrix.tsv.gz INFO:root:current input matrix looks identical to previously processed matrix, same file size, same sample names INFO:root:/tmp/cb/creighton_wt/meta.tsv has the same md5 as in /tmp/cb/creighton_wt/dataset.json, no need to rebuild meta data INFO:root:Reading sample names from /tmp/cb/creighton_wt/meta.tsv INFO:root:Checking and reordering meta data to /tmp/cb/creighton_wt/meta.tsv INFO:root:Reading sample names from /home/ubuntu/creighton_wt/meta.tsv INFO:root:Reading headers from file /home/ubuntu/creighton_wt/exprMatrix.tsv.gz WARNING:root:14576 sample names are in the expression matrix, but not in the meta data. Examples: ['wta;TATCGCCGTCGAACGA-1', 'wtb;GTTGAACAGACGCATG-1', 'wta;GGCACGTCACAACGCC-1', 'wta;GTTTGGACACATTC TT-1', 'wtb;AGGTCTACAGCCTTCT-1', 'wtb;CTACTATGTAGATTGA-1', 'wtb;GTCAAGTCATCGATGT-1', 'wta;TCGTAGACATCGGCCA-1', 'wta;TTCCTAATCTGGCCTT-1', 'wtb;TCAGCCTCAGTTAGAA-1'] WARNING:root:These samples will be removed from the expression matrix, if possible INFO:root:Data contains 6 samples/cells [...]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNJRE6MQXBWWVJQEX3UBTWWFANCNFSM5D4LLKRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

matthewspeir commented 2 years ago

@maximilianh did anything ever get changed here? Should we close this ticket?

maximilianh commented 2 years ago

I didn't do anything special and Daniel didn't get back to me. But also, I happened to change the reading of the sample names for other reasons, and I believe (hope) that this is not a problem anymore. There are two functions, readMatrixSampleNames and readMetaSampleNames and both don't do special character replacement anymore, but, I didn't test it either. Either way. it may be safe to close this ticket for now, because it's a rather unusual case.

On Fri, May 27, 2022 at 1:05 AM Matt Speir @.***> wrote:

@maximilianh https://github.com/maximilianh did anything ever get changed here? Should we close this ticket?

— Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/227#issuecomment-1139130358, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TN6AJ5KUDBRTAY4KADVL77SJANCNFSM5D4LLKRQ . You are receiving this because you were mentioned.Message ID: @.***>

matthewspeir commented 2 years ago

Duplicated+edited the matrix/meta/tsne for a small dataset to include a semicolon and all seems to be working fine.