nemoarchive / analytics

Repository for the NeMO Analytics project.
MIT License
1 stars 0 forks source link

3 datasets hanging at Gene Symbol loading step in view curator #153

Closed carlocolantuoni closed 3 years ago

carlocolantuoni commented 3 years ago

My collaborator Yash in Johns Hopkins Biomedical Engineering has successfully uploaded 3 new datasets to NeMO Analytics, but is encountering a problem when he tries to curate views of these datasets. Below i have included screen shots of the 3 datasets in the Dataset Explorer as well as a screen shot of where things stall for all 3 datstets in the curator. Basically, it never loads GeneSymbols for the datasets. We have looked at the "genes" column we uploaded and they look like properly formatted GeneSymbols to us. Is there a way to see what is causing the hangup? image (1) image (2)

carlocolantuoni commented 3 years ago

the URLs for the 3 datasets are: https://nemoanalytics.org/p?s=3ff685af https://nemoanalytics.org/p?s=8352d6b5 https://nemoanalytics.org/p?s=79c7a289

carlocolantuoni commented 3 years ago

yash and i have both tried to curate other datasets on our computers/browsers used for this and have no trouble getting GeneSymbols to come up for other datasets.

lets us know what else we can do to help.

thanks!

adkinsrs commented 3 years ago

The POST request to the API route to get the dataset's genes is throwing an error. I'm about to go onto the server and find out why

adkinsrs commented 3 years ago

The error is 'DataFrame' object has no attribute 'gene_symbol'

Looking at one of the datasets now.

>>> import anndata
>>> adata = anndata.read("./676783af-3879-7ba3-7574-08549d1a53a0.h5ad")
>>> adata.var
Empty DataFrame
Columns: []
Index: [RP11-560A15.3, RPS11, CREB3L1, RPL10P14, PNMA1, RP11-783L4.1, AC092634.2, RP11-798K23.4, TMEM216, TRAF3IP2-AS1, C10orf90, RP1-273G13.1, CTD-2240J17.4, ERCC5, RP11-96K19.5, RP11-201E8.1, APBB2, AC097724.3, KLHL13, RNU4ATAC2P, RP11-360F5.3, CADM4, MIR6500, XXbac-BPG157A10.21, CST2P1, SLC10A7, OR5H5P, CFHR5, OR2K2, LMAN1, RP11-6O2.3, CHD8, SUMO1, BOLA3-AS1, CTD-2193P3.1, IFNWP18, AC016561.1, AC012314.20, RP11-463J10.3, MMP7, MIR1976, RP11-335O4.3, CIR1P2, XAB2, Z85986.1, ADAM21P1, RP11-96B2.1, RN7SL499P, RP11-554L12.2, CTC-487M23.8, RNVU1-14, ZBTB12, UTY, CENPQ, RP4-754E20__A.5, DTNBP1, LINC00683, AC012065.4, RP11-70F11.11, ZG16, RP11-116N8.2, PRKAG2-AS1, MIR582, AC091178.2, AC006499.7, MIER1, RNA5SP93, RP11-384G23.1, ARID3C, RNU7-164P, RP1-39G22.7, WBP1LP6, RP11-271C24.2, TRMT112P4, LLNLR-284B4.1, MIR489, RP11-263I1.1, GRM2, MIR4511, PROSC, RNU1-124P, RP11-309L24.10, CXCL13, RP13-20L14.4, EHHADH-AS1, RP11-201K10.3, RNU6-332P, SYN3, LINC00210, SLC22A2, SERPINF1, WDR34, SUGCT, FAM8A6P, EPT1, BNIP3P5, KB-226F1.2, RP11-74J13.8, LHB, CTD-2515C13.2, ...]

The gene symbols were assigned to the index for the "anndata.var" dataframe, which is not what is expected. Typically the Ensembl ID is assigned to the index and the gene symbols are assigned to a separated "gene_symbol" column.

@carlocolantuoni is it possible to see one of the original uploaded files you and Yash uploaded?

carlocolantuoni commented 3 years ago

hey shaun, here is 1 of the tar balls yash uploaded

carlocolantuoni commented 3 years ago

looks like i attached nothing - shaun is there a way i can send a file here in github or should i email t?

carlocolantuoni commented 3 years ago

humvchimp.tar.gz https://drive.google.com/file/d/19Frl_-CH2H7xHRcrErf5Ahwi0hjPk2UU/view?usp=drive_web here is 1 of the tar balls yash uploaded

On Mon, Jun 28, 2021 at 12:17 PM Shaun Adkins @.***> wrote:

The error is 'DataFrame' object has no attribute 'gene_symbol'

Looking at one of the datasets now.

import anndata adata = anndata.read("./676783af-3879-7ba3-7574-08549d1a53a0.h5ad") adata.var Empty DataFrame Columns: [] Index: [RP11-560A15.3, RPS11, CREB3L1, RPL10P14, PNMA1, RP11-783L4.1, AC092634.2, RP11-798K23.4, TMEM216, TRAF3IP2-AS1, C10orf90, RP1-273G13.1, CTD-2240J17.4, ERCC5, RP11-96K19.5, RP11-201E8.1, APBB2, AC097724.3, KLHL13, RNU4ATAC2P, RP11-360F5.3, CADM4, MIR6500, XXbac-BPG157A10.21, CST2P1, SLC10A7, OR5H5P, CFHR5, OR2K2, LMAN1, RP11-6O2.3, CHD8, SUMO1, BOLA3-AS1, CTD-2193P3.1, IFNWP18, AC016561.1, AC012314.20, RP11-463J10.3, MMP7, MIR1976, RP11-335O4.3, CIR1P2, XAB2, Z85986.1, ADAM21P1, RP11-96B2.1, RN7SL499P, RP11-554L12.2, CTC-487M23.8, RNVU1-14, ZBTB12, UTY, CENPQ, RP4-754E20__A.5, DTNBP1, LINC00683, AC012065.4, RP11-70F11.11, ZG16, RP11-116N8.2, PRKAG2-AS1, MIR582, AC091178.2, AC006499.7, MIER1, RNA5SP93, RP11-384G23.1, ARID3C, RNU7-164P, RP1-39G22.7, WBP1LP6, RP11-271C24.2, TRMT112P4, LLNLR-284B4.1, MIR489, RP11-263I1.1, GRM2, MIR4511, PROSC, RNU1-124P, RP11-309L24.10, CXCL13, RP13-20L14.4, EHHADH-AS1, RP11-201K10.3, RNU6-332P, SYN3, LINC00210, SLC22A2, SERPINF1, WDR34, SUGCT, FAM8A6P, EPT1, BNIP3P5, KB-226F1.2, RP11-74J13.8, LHB, CTD-2515C13.2, ...]

The gene symbols were assigned to the index for the "anndata.var" dataframe, which is not what is expected. Typically the Ensembl ID is assigned to the index and the gene symbols are assigned to a separated "gene_symbol" column.

@carlocolantuoni https://github.com/carlocolantuoni is it possible to see one of the original uploaded files you and Yash uploaded?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nemoarchive/analytics/issues/153#issuecomment-869820782, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7UUCHYJK2AOZFNZP4DTVCOCBANCNFSM47OIH3GQ .

-- Carlo

carlocolantuoni commented 3 years ago

sent an email with google drive attachment - looks like it worked in the link above - let me know if u can get the file

adkinsrs commented 3 years ago

Received tarball... thanks!

adkinsrs commented 3 years ago

@carlocolantuoni

In the genes.tab file, both the Ensembl ID and the gene_symbol need to be provided, like so (in this random genes.tab file I had on hand).

ensembl_ID  gene_symbol
ENSMUSG00000051951  Xkr4
ENSMUSG00000089699  Gm1992
ENSMUSG00000102343  Gm37381
ENSMUSG00000025900  Rp1
ENSMUSG00000109048  Rp1
ENSMUSG00000025902  Sox17
ENSMUSG00000104328  Gm37323
ENSMUSG00000033845  Mrpl15
ENSMUSG00000025903  Lypla1

Lots of gEAR code relies on the "gene_symbol" column in the AnnData object, and if the "gene_symbol" column is the only column uploaded via the genes.tab file, then it is treated as the index column instead. Can you and Yash make this correction and resubmit?

carlocolantuoni commented 3 years ago

will do thnx!

On Mon, Jun 28, 2021 at 2:27 PM Shaun Adkins @.***> wrote:

@carlocolantuoni https://github.com/carlocolantuoni

In the genes.tab file, both the Ensembl ID and the gene_symbol need to be provided, like so (in this random genes.tab file I had on hand).

ensembl_ID gene_symbol ENSMUSG00000051951 Xkr4 ENSMUSG00000089699 Gm1992 ENSMUSG00000102343 Gm37381 ENSMUSG00000025900 Rp1 ENSMUSG00000109048 Rp1 ENSMUSG00000025902 Sox17 ENSMUSG00000104328 Gm37323 ENSMUSG00000033845 Mrpl15 ENSMUSG00000025903 Lypla1

Lots of gEAR code relies on the "gene_symbol" column in the AnnData object, and if the "gene_symbol" column is the only column uploaded via the genes.tab file, then it is treated as the index column instead. Can you and Yash make this correction and resubmit?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nemoarchive/analytics/issues/153#issuecomment-869917554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7SAJPAYEVNKGLFNW6LTVC5KBANCNFSM47OIH3GQ .

-- Carlo

carlocolantuoni commented 3 years ago

it worked shaun! thanks!

On Mon, Jun 28, 2021 at 2:27 PM Shaun Adkins @.***> wrote:

@carlocolantuoni https://github.com/carlocolantuoni

In the genes.tab file, both the Ensembl ID and the gene_symbol need to be provided, like so (in this random genes.tab file I had on hand).

ensembl_ID gene_symbol ENSMUSG00000051951 Xkr4 ENSMUSG00000089699 Gm1992 ENSMUSG00000102343 Gm37381 ENSMUSG00000025900 Rp1 ENSMUSG00000109048 Rp1 ENSMUSG00000025902 Sox17 ENSMUSG00000104328 Gm37323 ENSMUSG00000033845 Mrpl15 ENSMUSG00000025903 Lypla1

Lots of gEAR code relies on the "gene_symbol" column in the AnnData object, and if the "gene_symbol" column is the only column uploaded via the genes.tab file, then it is treated as the index column instead. Can you and Yash make this correction and resubmit?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nemoarchive/analytics/issues/153#issuecomment-869917554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7KC7SAJPAYEVNKGLFNW6LTVC5KBANCNFSM47OIH3GQ .

-- Carlo