theislab / sfaira

data and model repository for single-cell data
https://sfaira.readthedocs.io
BSD 3-Clause "New" or "Revised" License
133 stars 13 forks source link

sfaira finalize-dataloader broken after cellxgene 3.0 schema introduction #700

Closed le-ander closed 1 year ago

le-ander commented 1 year ago

The CLI command finalize-dataloader seems to be broken on dev. this seems to be related to the recent introduction of the new cellxgene schema. this looks like it's simple to fix but I don't fully grasp the cellxgene schema handling yet, so @davidsebfischer can you take a look?

│ /opt/python/bin/sfaira:8 in <module>                                                                 │
│                                                                                                      │
│   5 from sfaira.cli import main                                                                      │
│   6 if __name__ == '__main__':                                                                       │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                             │
│ ❱ 8 │   sys.exit(main())                                                                             │
│   9                                                                                                  │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/cli.py:74 in main                                  │
│                                                                                                      │
│    71 │   # Is the latest sfaira version installed? Upgrade if not!                                  │
│    72 │   if not UpgradeCommand.check_sfaira_latest():                                               │
│    73 │   │   print('[bold blue]Run [green]sfaira upgrade [blue]to get the latest version.')         │
│ ❱  74 │   sfaira_cli()                                                                               │
│    75                                                                                                │
│    76                                                                                                │
│    77 @click.group()                                                                                 │
│                                                                                                      │
│ /opt/python/lib/python3.8/site-packages/click/core.py:1128 in __call__                               │
│                                                                                                      │
│   1125 │                                                                                             │
│   1126 │   def __call__(self, *args: t.Any, **kwargs: t.Any) -> t.Any:                               │
│   1127 │   │   """Alias for :meth:`main`."""                                                         │
│ ❱ 1128 │   │   return self.main(*args, **kwargs)                                                     │
│   1129                                                                                               │
│   1130                                                                                               │
│   1131 class Command(BaseCommand):                                                                   │
│                                                                                                      │
│ /opt/python/lib/python3.8/site-packages/click/core.py:1053 in main                                   │
│                                                                                                      │
│   1050 │   │   try:                                                                                  │
│   1051 │   │   │   try:                                                                              │
│   1052 │   │   │   │   with self.make_context(prog_name, args, **extra) as ctx:                      │
│ ❱ 1053 │   │   │   │   │   rv = self.invoke(ctx)                                                     │
│   1054 │   │   │   │   │   if not standalone_mode:                                                   │
│   1055 │   │   │   │   │   │   return rv                                                             │
│   1056 │   │   │   │   │   # it's not safe to `ctx.exit(rv)` here!                                   │
│                                                                                                      │
│ /opt/python/lib/python3.8/site-packages/click/core.py:1659 in invoke                                 │
│                                                                                                      │
│   1656 │   │   │   │   super().invoke(ctx)                                                           │
│   1657 │   │   │   │   sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)                        │
│   1658 │   │   │   │   with sub_ctx:                                                                 │
│ ❱ 1659 │   │   │   │   │   return _process_result(sub_ctx.command.invoke(sub_ctx))                   │
│   1660 │   │                                                                                         │
│   1661 │   │   # In chain mode we create the contexts step by step, but after the                    │
│   1662 │   │   # base command has been invoked.  Because at that point we do not                     │
│                                                                                                      │
│ /opt/python/lib/python3.8/site-packages/click/core.py:1395 in invoke                                 │
│                                                                                                      │
│   1392 │   │   │   echo(style(message, fg="red"), err=True)                                          │
│   1393 │   │                                                                                         │
│   1394 │   │   if self.callback is not None:                                                         │
│ ❱ 1395 │   │   │   return ctx.invoke(self.callback, **ctx.params)                                    │
│   1396 │                                                                                             │
│   1397 │   def shell_complete(self, ctx: Context, incomplete: str) -> t.List["CompletionItem"]:      │
│   1398 │   │   """Return a list of completions for the incomplete value. Looks                       │
│                                                                                                      │
│ /opt/python/lib/python3.8/site-packages/click/core.py:754 in invoke                                  │
│                                                                                                      │
│    751 │   │                                                                                         │
│    752 │   │   with augment_usage_errors(__self):                                                    │
│    753 │   │   │   with ctx:                                                                         │
│ ❱  754 │   │   │   │   return __callback(*args, **kwargs)                                            │
│    755 │                                                                                             │
│    756 │   def forward(                                                                              │
│    757 │   │   __self, __cmd: "Command", *args: t.Any, **kwargs: t.Any  # noqa: B902                 │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/cli.py:228 in finalize_dataloader                  │
│                                                                                                      │
│   225 │   Formats .tsvs and runs a full data loader test.                                            │
│   226 │   """                                                                                        │
│   227 │   path_loader, path_data, _ = set_paths(loader=path_loader, data=path_data)                  │
│ ❱ 228 │   _full_test(path_loader=path_loader, path_data=path_data, doi=doi, schema=schema,           │
│       clean_tsvs=True, in_phase_3=True)                                                              │
│   229                                                                                                │
│   230                                                                                                │
│   231 @sfaira_cli.command()                                                                          │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/cli.py:188 in _full_test                           │
│                                                                                                      │
│   185 │   │   dataloader_validator = DataloaderValidator(path_loader=path_loader, doi=doi,           │
│       schema=schema)                                                                                 │
│   186 │   │   dataloader_validator.validate()                                                        │
│   187 │   │   dataloader_tester = DataloaderTester(path_loader, path_data, doi)                      │
│ ❱ 188 │   │   dataloader_tester.test_dataloader(clean_tsvs=clean_tsvs, in_phase_3=in_phase_3)        │
│   189 │   else:                                                                                      │
│   190 │   │   print('[bold red]The supplied DOI is malformed!')  # noqa: W605                        │
│   191                                                                                                │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/commands/test_dataloader.py:35 in test_dataloader  │
│                                                                                                      │
│    32 │   │   Runs a predefined unit test on a given dataloader.                                     │
│    33 │   │   """                                                                                    │
│    34 │   │   self.doi_sfaira_repr = clean_doi(self.doi)                                             │
│ ❱  35 │   │   self._test_dataloader(clean_tsvs=clean_tsvs, in_phase_3=in_phase_3)                    │
│    36 │                                                                                              │
│    37 │   def _get_ds(self):                                                                         │
│    38 │   │   return get_ds(doi_sfaira_repr=self.doi_sfaira_repr, path_data=self.path_data,          │
│       path_loader=self.path_loader)                                                                  │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/commands/test_dataloader.py:129 in                 │
│ _test_dataloader                                                                                     │
│                                                                                                      │
│   126 │   │   │   │   │   │   print(f'[bold red]Did not find column {val} for {x} in data set        │
│       {k}, found: '                                                                                  │
│   127 │   │   │   │   │   │   │     f'{v.adata.var.columns}.')                                       │
│   128 │   │   │   │   │   │   sys.exit()                                                             │
│ ❱ 129 │   │   │   v.streamline_var(match_to_release=None, schema="cellxgene:" + "2.0.0")             │
│   130 │   │   │   signal_proc = np.asarray(v.adata.X.sum()).sum()                                    │
│   131 │   │   │   if signal_proc < 0.01 * signal_raw and v.feature_type != "peak":                   │
│   132 │   │   │   │   print('[bold red]Mapping your feature space to a reference annotation          │
│       resulted in a heavy loss of '                                                                  │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/data/dataloaders/base/dataset.py:283 in            │
│ streamline_var                                                                                       │
│                                                                                                      │
│   280 │   │   :param verbose: Report feature transformation statistics.                              │
│   281 │   │   """                                                                                    │
│   282 │   │   self._assert_loaded()                                                                  │
│ ❱ 283 │   │   adata_target_ids = get_target_ids(schema=schema)                                       │
│   284 │   │   if isinstance(match_to_release, dict):                                                 │
│   285 │   │   │   match_to_release = match_to_release[self.organism]                                 │
│   286 │   │   if subset_layer is not None:                                                           │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/data/dataloaders/base/dataset.py:35 in             │
│ get_target_ids                                                                                       │
│                                                                                                      │
│    32 │   │   │   v = schema.split(":")[1]                                                           │
│    33 │   │   else:                                                                                  │
│    34 │   │   │   v = DEFAULT_SCHEMA                                                                 │
│ ❱  35 │   │   target_ids = AdataIdsCellxgeneVersions[v]                                              │
│    36 │   else:                                                                                      │
│    37 │   │   raise ValueError(f"did not recognize schema {schema}")                                 │
│    38 │   return target_ids                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: '2.0.0'
davidsebfischer commented 1 year ago

Could you please try the fix in the linked pull request? Could not run it but pretty sure it addresses this issue.

le-ander commented 1 year ago

Thanks David! This fixes the error, however a new one appeared (this is still from finalize-dataloader). Can you explain what this warning means? ALso not sure we should really raise this warning as the code aborts after raise?

│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/data/dataloaders/export_adaptors/cellxgene.py:323  │
│ in cellxgene_export_adaptor_3_0_0                                                                    │
│                                                                                                      │
│   320 │   # Suspension is a custom ontology and does not require an ID. The column itself must       │
│       be categorical.                                                                                │
│   321 │   # Add this annotation here if it was not set before.                                       │
│   322 │   if np.all(adata.obs[adata_ids.suspension_type].values ==                                   │
│       adata_ids.unknown_metadata_identifier):                                                        │
│ ❱ 323 │   │   adata = match_supsension_and_efo(adata=adata, adata_ids=adata_ids,                     │
│       efo_ontology=get_ontology(k="assay_sc"),                                                       │
│   324 │   │   │   │   │   │   │   │   │   │    valid_combinations=VALID_EFO_SUS["3_0_0"])            │
│   325 │   adata.obs[adata_ids.suspension_type] =                                                     │
│       pd.Categorical(adata.obs[adata_ids.suspension_type].values.tolist())                           │
│   326 │   del adata.obs[adata_ids.suspension_type + adata_ids.onto_id_suffix]                        │
│                                                                                                      │
│ /lustre/groups/ml01/code/katelyn.li/sfaira/sfaira/data/dataloaders/export_adaptors/cellxgene.py:239  │
│ in match_supsension_and_efo                                                                          │
│                                                                                                      │
│   236 │   │   for k, v in valid_combinations.items():                                                │
│   237 │   │   │   if efo_ontology.is_a(is_=efo, a_=k):                                               │
│   238 │   │   │   │   if efo in efo_map.keys():                                                      │
│ ❱ 239 │   │   │   │   │   raise Warning(f"found multiple suspension matches for EFO {efo}.")         │
│   240 │   │   │   │   efo_map[efo] = v[0]                                                            │
│   241 │   # Picks first match, ie prioritises cell if both cell and nucleus are possible.            │
│   242 │   adata.obs[adata_ids.suspension_type] = [                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────╯
Warning: found multiple suspension matches for EFO EFO:0009922.
davidsebfischer commented 1 year ago

I did not realize this warning would abort the CLI, yes in that case we can move this to a print statement. In brief, suspension type is a new required meta data that can be automatically added but this is ambiguous in a number of cases, this was triggered here, we then randomly default on choice if it was not separately supplied by the user. Could you change the warning to a print statement and then continue? If you are planning on using suspension further down it would make sense to manually set it here, too.

le-ander commented 1 year ago

Will try, thank you! what are the options for suspension type then? (this is 10x 3' v3)

davidsebfischer commented 1 year ago

"cell", "na", "nucleus", see also https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#suspension_type

le-ander commented 1 year ago

Fixed :+1: