linnarsson-lab / loompy

Python implementation of the Loom file format - http://loompy.org
BSD 2-Clause "Simplified" License
139 stars 37 forks source link

unable to connect to loom file #78

Closed orrzor closed 5 years ago

orrzor commented 5 years ago

Hi Loom team, I am having trouble reading in a loom file using loompy. I started a jupyter lab session in a conda environment and tried to read in a loom file from http://scope.aertslab.org/#/Bernard_Thienpont , under the User Uploaded -> Uncategorized tab (Thienpont_T-cell_v4_R_fixed.loom).

Here are the relevant parts of my conda environment

loompy                    2.0.15                    <pip>
python                    3.6.6                h5001a0f_0    conda-forge

Here is the error below, I don't know if this is an issue with the latest loompy version? Thanks so much for any help! -Orr

import loompy
with loompy.connect('./Thienpont_T-cell_v4_R_fixed.loom') as ds:
    ds.shape

Column attribute 'RegulonsAUC' dtype [('TBX21_extended (69g)', '<f8'), ('TBX21 (58g)', '<f8'), ('ELF1_extended (987g)', '<f8'), ('ELF1 (753g)', '<f8'), ('EOMES_extended (223g)', '<f8'), ('EOMES (168g)', '<f8'), ('RUNX3_extended (532g)', '<f8'), ('RUNX3 (414g)', '<f8'), ('PRDM1_extended (538g)', '<f8'), ('ETS1_extended (647g)', '<f8'), ('ETS1 (577g)', '<f8'), ('ZNF683_extended (75g)', '<f8'), ('IRF1_extended (662g)', '<f8'), ('IRF1 (617g)', '<f8'), ('JUN_extended (62g)', '<f8'), ('JUN (26g)', '<f8'), ('IRF7_extended (688g)', '<f8'), ('IRF7 (616g)', '<f8'), ('IRF9_extended (358g)', '<f8'), ('IRF9 (302g)', '<f8'), ('MAFB_extended (39g)', '<f8'), ('NR1H3_extended (208g)', '<f8'), ('NR1H3 (207g)', '<f8'), ('CEBPB_extended (1162g)', '<f8'), ('CEBPB (1045g)', '<f8'), ('ETS2_extended (932g)', '<f8'), ('ETS2 (688g)', '<f8'), ('YBX1_extended (212g)', '<f8'), ('SPI1_extended (1805g)', '<f8'), ('SPI1 (1756g)', '<f8'), ('ATF5 (94g)', '<f8'), ('TFEC_extended (694g)', '<f8'), ('TFEC (483g)', '<f8'), ('BHLHE41_extended (106g)', '<f8'), ('BHLHE41 (97g)', '<f8'), ('PPARG_extended (235g)', '<f8'), ('PPARG (90g)', '<f8'), ('MEF2C_extended (106g)', '<f8'), ('MEF2C (95g)', '<f8'), ('REL_extended (497g)', '<f8'), ('REL (434g)', '<f8'), ('SPIB_extended (207g)', '<f8'), ('SPIB (195g)', '<f8'), ('IRF8_extended (923g)', '<f8'), ('IRF8 (687g)', '<f8'), ('RORA (51g)', '<f8'), ('NR3C1_extended (94g)', '<f8'), ('NR3C1 (63g)', '<f8'), ('TCF7_extended (52g)', '<f8'), ('TCF7 (47g)', '<f8'), ('SF1 (234g)', '<f8'), ('GABPB1_extended (134g)', '<f8'), ('FOXP1_extended (334g)', '<f8'), ('FOXP1 (286g)', '<f8'), ('CREM_extended (507g)', '<f8'), ('CREM (442g)', '<f8'), ('FOS_extended (436g)', '<f8'), ('JUNB_extended (84g)', '<f8'), ('JUNB (35g)', '<f8'), ('IKZF2 (13g)', '<f8'), ('BATF_extended (35g)', '<f8'), ('BATF (21g)', '<f8'), ('FOXP3_extended (76g)', '<f8'), ('FOXP3 (23g)', '<f8'), ('PRDM1 (15g)', '<f8'), ('MAF_extended (119g)', '<f8'), ('MAF (22g)', '<f8'), ('UQCRB (573g)', '<f8'), ('LEF1 (20g)', '<f8'), ('GATA3_extended (25g)', '<f8'), ('GATA3 (18g)', '<f8'), ('IRF4_extended (49g)', '<f8'), ('IRF4 (20g)', '<f8'), ('KLF2 (32g)', '<f8'), ('CEBPD_extended (330g)', '<f8'), ('CEBPD (241g)', '<f8'), ('FOSL2 (109g)', '<f8'), ('FOSL2_extended (442g)', '<f8'), ('JUND_extended (789g)', '<f8'), ('JUND (274g)', '<f8'), ('IRF6_extended (14g)', '<f8'), ('NR2F6_extended (176g)', '<f8'), ('NR2F6 (121g)', '<f8'), ('MAFK_extended (280g)', '<f8'), ('MAFK (117g)', '<f8'), ('ELF3 (260g)', '<f8'), ('HTATIP2 (179g)', '<f8'), ('PIR (247g)', '<f8'), ('XBP1_extended (669g)', '<f8'), ('XBP1 (587g)', '<f8'), ('KLF5_extended (1521g)', '<f8'), ('KLF5 (1038g)', '<f8'), ('NR2F2_extended (36g)', '<f8'), ('NR2F2 (19g)', '<f8'), ('GLIS3_extended (87g)', '<f8'), ('MYLK (52g)', '<f8'), ('HOXA5 (16g)', '<f8'), ('FOXO1_extended (85g)', '<f8'), ('FOXO1 (72g)', '<f8'), ('ZEB1 (42g)', '<f8'), ('MSC_extended (56g)', '<f8'), ('MSC (30g)', '<f8'), ('ZNF503 (12g)', '<f8'), ('HOXB2_extended (50g)', '<f8'), ('HOXB2 (23g)', '<f8'), ('FOS (31g)', '<f8'), ('FOSB_extended (403g)', '<f8'), ('FOSB (207g)', '<f8'), ('EGR1_extended (284g)', '<f8'), ('EGR1 (199g)', '<f8'), ('ATF3_extended (333g)', '<f8'), ('ATF3 (265g)', '<f8'), ('NFKB2_extended (107g)', '<f8'), ('NFKB2 (95g)', '<f8'), ('NFKB1_extended (156g)', '<f8'), ('NFKB1 (146g)', '<f8'), ('GATA2_extended (103g)', '<f8'), ('GATA2 (99g)', '<f8'), ('RELB_extended (94g)', '<f8'), ('RELB (88g)', '<f8'), ('NKX2-1_extended (14g)', '<f8'), ('BCL3_extended (250g)', '<f8'), ('BCL3 (201g)', '<f8'), ('FOSL1_extended (360g)', '<f8'), ('FOSL1 (253g)', '<f8'), ('KLF4 (20g)', '<f8'), ('KLF10_extended (70g)', '<f8'), ('RFX2_extended (248g)', '<f8'), ('RFX2 (241g)', '<f8'), ('ETV1_extended (21g)', '<f8'), ('ETV1 (11g)', '<f8'), ('ETV5_extended (12g)', '<f8'), ('NFIX_extended (59g)', '<f8'), ('NFIX (57g)', '<f8'), ('KLF7_extended (88g)', '<f8'), ('KLF7 (42g)', '<f8'), ('FLI1_extended (111g)', '<f8'), ('FLI1 (51g)', '<f8'), ('HES1_extended (95g)', '<f8'), ('HES1 (45g)', '<f8'), ('TEAD1_extended (417g)', '<f8'), ('TEAD1 (413g)', '<f8'), ('ELK3_extended (186g)', '<f8'), ('ELK3 (65g)', '<f8'), ('MECOM (21g)', '<f8'), ('LUZP1 (111g)', '<f8'), ('SOX7_extended (39g)', '<f8'), ('BCL6_extended (18g)', '<f8'), ('BCL6 (16g)', '<f8'), ('NFIL3_extended (163g)', '<f8'), ('NFIL3 (47g)', '<f8'), ('CREB5_extended (139g)', '<f8'), ('CREB5 (21g)', '<f8'), ('NFE2L2_extended (253g)', '<f8'), ('HIF1A_extended (207g)', '<f8'), ('HIF1A (206g)', '<f8'), ('FOXO3_extended (191g)', '<f8'), ('FOXO3 (168g)', '<f8'), ('MYC_extended (128g)', '<f8'), ('MYC (53g)', '<f8'), ('TFAP2A (70g)', '<f8'), ('TFAP2C (32g)', '<f8'), ('HOXC9_extended (18g)', '<f8'), ('HOXC9 (14g)', '<f8'), ('SOX4_extended (320g)', '<f8'), ('SOX4 (260g)', '<f8'), ('ELF3_extended (1249g)', '<f8'), ('EHF_extended (1513g)', '<f8'), ('EHF (762g)', '<f8'), ('HES4 (19g)', '<f8'), ('HES4_extended (59g)', '<f8'), ('PBX1 (34g)', '<f8'), ('TFDP2_extended (141g)', '<f8'), ('HDAC2 (105g)', '<f8'), ('CREB3_extended (137g)', '<f8'), ('CREB3 (99g)', '<f8'), ('CREB3L2_extended (372g)', '<f8'), ('CREB3L2 (289g)', '<f8'), ('ATF4_extended (261g)', '<f8'), ('ATF4 (168g)', '<f8'), ('DDIT3_extended (23g)', '<f8'), ('DDIT3 (19g)', '<f8'), ('CEBPG_extended (40g)', '<f8'), ('CEBPG (32g)', '<f8'), ('BDP1_extended (70g)', '<f8'), ('SMARCC2_extended (91g)', '<f8'), ('SMARCC2 (90g)', '<f8'), ('RFX3_extended (391g)', '<f8'), ('RFX3 (380g)', '<f8'), ('BCLAF1_extended (109g)', '<f8'), ('YY1_extended (240g)', '<f8'), ('YY1 (188g)', '<f8'), ('ATF2_extended (80g)', '<f8'), ('ATF2 (58g)', '<f8'), ('ERF_extended (63g)', '<f8'), ('ERF (49g)', '<f8'), ('ELK4_extended (50g)', '<f8'), ('ELK4 (38g)', '<f8'), ('ELF2_extended (123g)', '<f8'), ('ELF2 (85g)', '<f8'), ('GABPA_extended (79g)', '<f8'), ('GABPA (67g)', '<f8'), ('TAF1_extended (39g)', '<f8'), ('TAF1 (20g)', '<f8'), ('KLF3_extended (102g)', '<f8'), ('KLF3 (16g)', '<f8'), ('FOXP4_extended (14g)', '<f8'), ('FOXP4 (11g)', '<f8'), ('NR2C2_extended (37g)', '<f8'), ('NR2C2 (25g)', '<f8'), ('CEBPA_extended (30g)', '<f8'), ('CEBPA (28g)', '<f8'), ('ZNF143_extended (22g)', '<f8'), ('ZNF143 (20g)', '<f8'), ('RUNX2_extended (19g)', '<f8'), ('RUNX2 (12g)', '<f8'), ('SOX2_extended (41g)', '<f8'), ('SOX2 (19g)', '<f8'), ('FOXJ1_extended (37g)', '<f8'), ('FOXJ1 (20g)', '<f8'), ('RARA_extended (16g)', '<f8'), ('RARA (10g)', '<f8'), ('RELA_extended (33g)', '<f8'), ('RELA (14g)', '<f8'), ('MAFF_extended (22g)', '<f8'), ('MAFF (13g)', '<f8'), ('MYBL2_extended (141g)', '<f8'), ('MYBL2 (140g)', '<f8'), ('FOXM1_extended (14g)', '<f8'), ('FOXM1 (11g)', '<f8'), ('POLE4_extended (86g)', '<f8'), ('POLE4 (83g)', '<f8'), ('TFDP1_extended (106g)', '<f8'), ('TFDP1 (93g)', '<f8'), ('NRF1_extended (37g)', '<f8'), ('NRF1 (30g)', '<f8'), ('ELK1_extended (72g)', '<f8'), ('ELK1 (63g)', '<f8'), ('SP1_extended (59g)', '<f8'), ('SP1 (31g)', '<f8'), ('RXRB_extended (11g)', '<f8'), ('RXRB (11g)', '<f8'), ('NFYC_extended (23g)', '<f8'), ('NFYC (21g)', '<f8'), ('TGIF2_extended (38g)', '<f8'), ('TGIF2 (36g)', '<f8'), ('ESRRA_extended (34g)', '<f8'), ('ESRRA (27g)', '<f8'), ('POU2F1_extended (13g)', '<f8'), ('POU2F1 (12g)', '<f8'), ('RXRA_extended (107g)', '<f8'), ('RXRA (78g)', '<f8'), ('MAFG_extended (64g)', '<f8'), ('MAFG (25g)', '<f8'), ('SRF_extended (14g)', '<f8'), ('SRF (13g)', '<f8'), ('HIVEP2_extended (40g)', '<f8'), ('HIVEP2 (28g)', '<f8'), ('ZBTB41 (12g)', '<f8'), ('SETDB1_extended (46g)', '<f8'), ('SETDB1 (41g)', '<f8'), ('ETV3_extended (50g)', '<f8'), ('ETV3 (35g)', '<f8'), ('HSF1_extended (47g)', '<f8'), ('HSF1 (14g)', '<f8'), ('E2F3_extended (42g)', '<f8'), ('E2F3 (34g)', '<f8'), ('USF2_extended (41g)', '<f8'), ('USF2 (33g)', '<f8'), ('MYB_extended (68g)', '<f8'), ('MYB (63g)', '<f8'), ('CREB1_extended (81g)', '<f8'), ('CREB1 (61g)', '<f8'), ('MAX_extended (83g)', '<f8'), ('MAX (80g)', '<f8'), ('KLF12_extended (36g)', '<f8'), ('KLF12 (31g)', '<f8'), ('RFX5_extended (14g)', '<f8'), ('RFX5 (13g)', '<f8'), ('CLOCK_extended (52g)', '<f8'), ('CLOCK (13g)', '<f8'), ('SREBF2_extended (39g)', '<f8'), ('SREBF2 (38g)', '<f8'), ('ARNT_extended (52g)', '<f8'), ('ARNT (16g)', '<f8'), ('MLX_extended (53g)', '<f8'), ('MLX (21g)', '<f8'), ('SP2 (11g)', '<f8'), ('BACH1_extended (63g)', '<f8'), ('BACH1 (14g)', '<f8'), ('BHLHE40_extended (77g)', '<f8'), ('BHLHE40 (16g)', '<f8'), ('NFATC1_extended (13g)', '<f8'), ('NPDC1_extended (12g)', '<f8'), ('CREBZF (15g)', '<f8'), ('ZNF100 (15g)', '<f8'), ('PPARD_extended (52g)', '<f8'), ('PPARD (20g)', '<f8'), ('ELF4_extended (49g)', '<f8'), ('ELF4 (12g)', '<f8'), ('TEAD4 (33g)', '<f8'), ('NFIC (31g)', '<f8'), ('HDAC1 (10g)', '<f8'), ('CTCF (13g)', '<f8'), ('SMC3_extended (50g)', '<f8'), ('RAD21 (26g)', '<f8'), ('BRF1 (35g)', '<f8'), ('HINFP (16g)', '<f8'), ('MLXIP_extended (11g)', '<f8'), ('ATF6B_extended (61g)', '<f8'), ('ZNF260 (11g)', '<f8'), ('MEF2A (16g)', '<f8'), ('FOXK1_extended (22g)', '<f8'), ('FOXN3_extended (12g)', '<f8'), ('ASCL2_extended (16g)', '<f8'), ('MXI1_extended (23g)', '<f8'), ('SMAD3 (13g)', '<f8'), ('CENPB (29g)', '<f8'), ('NFE2L3_extended (15g)', '<f8'), ('ZNF664 (18g)', '<f8'), ('TBX3_extended (32g)', '<f8'), ('MEF2D_extended (10g)', '<f8'), ('BATF3_extended (45g)', '<f8'), ('TCF7L2 (27g)', '<f8'), ('FOXJ3_extended (12g)', '<f8'), ('ATF6_extended (47g)', '<f8'), ('ZNF254 (15g)', '<f8'), ('E2F6_extended (37g)', '<f8'), ('ZBTB33 (19g)', '<f8'), ('MYEF2_extended (32g)', '<f8'), ('THAP1_extended (19g)', '<f8'), ('IRF5_extended (27g)', '<f8'), ('STAT5A_extended (24g)', '<f8'), ('NR1D2_extended (13g)', '<f8'), ('HIVEP1_extended (11g)', '<f8'), ('TCF3 (12g)', '<f8'), ('MELK (11g)', '<f8'), ('TP53 (10g)', '<f8'), ('ETV6_extended (56g)', '<f8'), ('POLE3_extended (54g)', '<f8'), ('ZBTB7B_extended (26g)', '<f8'), ('MAZ_extended (96g)', '<f8'), ('ARNTL2_extended (33g)', '<f8'), ('KLF16_extended (36g)', '<f8'), ('FOXK2 (11g)', '<f8'), ('TFEB_extended (128g)', '<f8'), ('TFEB (92g)', '<f8'), ('POU2AF1_extended (64g)', '<f8'), ('POU2AF1 (14g)', '<f8'), ('POU2F2_extended (78g)', '<f8'), ('POU2F2 (66g)', '<f8'), ('BCL11A_extended (207g)', '<f8'), ('BCL11A (142g)', '<f8'), ('ETV7_extended (90g)', '<f8'), ('ETV7 (63g)', '<f8'), ('STAT1_extended (1443g)', '<f8'), ('STAT1 (1212g)', '<f8'), ('IRF2_extended (196g)', '<f8'), ('IRF2 (179g)', '<f8'), ('STAT2_extended (287g)', '<f8'), ('STAT2 (234g)', '<f8')] is not allowed
For help, see http://linnarssonlab.org/loompy/format/
./Thienpont_T-cell_v4_R_fixed.loom does not appead to be a valid Loom file according to Loom spec version '2.0.1'
slinnarsson commented 5 years ago

Hi - sorry for your troubles. The latest versions of Loom tries to be a bit more strict about the Loom format, in an attempt to better enforce compatibility between consumers of the loom format. By default, before opening a file, the file is validated against the latest Loom file specification and any deviations are reported.

I looked at the file you're trying to open, and it has at least two issues. First, strings are stored as variable-length UTF8, not HTML entity-encoded fixed-length ASCII as required by the spec. Loom requires strings to be stored as fixed-length ASCII only (but supports Unicode by using HTML entity encodings). We might consider relaxing this requirement, but that will complicate Loom readers a lot. You can store strings in HDF5 in many ways (fixed-length or variable, UTF8 or ASCII, null-terminated or not, null or space-padded, ...) but fixed-length ASCII is the one that is most broadly supported across platforms and languages.

The second issue is that the file contains HDF5 enum datatypes (e.g. row attribute Regulons), which are not supported by Loom (because not supported by h5py). When reading such attributes, they will come out as arrays of integers, but the string representation of the enumerated datatype will be lost unfortunately. Interestingly, the error message reveals that h5py has actually correctly parsed out all the enum

You can turn off validation by adding the validate=False flag to the loompy.connect(...) call. This will bypass strict validation, but you may still encounter errors when reading non-standard Loom files. I've added a fix now that allows variable-length UTF8 strings to be read (I had already added support for variable-length ASCII as sometimes produced by R), which I will push to PyPi shortly.

orrzor commented 5 years ago

Hi Sten, Thank you so much! I pulled your latest commit and used the validate=False flag and can now read in the loom file. I really appreciate your quick fix and support for the loom format. Thanks also for your clear explanation of why I was getting this problem.

Best, Orr