jkobject / scDataLoader

a dataloader to work with large single cell datasets from lamindb
https://www.jkobject.com/scDataLoader/
GNU General Public License v3.0
13 stars 3 forks source link

Trouble loading data #6

Open GuyAglionby opened 2 weeks ago

GuyAglionby commented 2 weeks ago

Describe the bug + reproduction steps I'm trying to populate a database using the command in __main__.py. I'm using scDataLoader 1.1.3 and lamindb 0.76.3. A similar error occurs using scDataLoader 1.1.4, lamindb 0.76.9, and the 2024-07-01 cellxgene release.

I've run:

lamin login my_lamin_username
scdataloader --instance="laminlabs/cellxgene" --name="cellxgene-census" --version="2023-12-15" --description="preprocessed for scprint" --new_name="scprint main" --start_at=39

The logs are as follows

→ connected lamindb: laminlabs/cellxgene
! no run & transform get linked, consider calling ln.context.track()
using the dataset  Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC')  of size  1113
! no run & transform get linked, consider calling ln.context.track()
0
Artifact(uid='wYiUe9hn4TJijpoXVMkL', version='2023-12-15', is_latest=False, description='All major cell types in adult human retina', key='cell-census/2023-12-15/h5ads/0129dbd9-a7d3-4f6b-96b9-1da155a93748.h5ad', suffix='.h5ad', size=18961419177, hash='GqiLQmtIygK1IwnR7noOyA', n_observations=244474, _hash_type='md5-n', _accessor='AnnData', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=2, transform_id=16, run_id=22, updated_at='2024-01-29 07:46:01 UTC')
! `.backed()` is deprecated, use `.open()`!'
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/scdataloader/preprocess.py:179: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  adata.obs["nnz"] = np.array(np.sum(adata.X != 0, axis=1).flatten())[0]
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/scdataloader/preprocess.py:241: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  data_utils.validate(adata, organism=adata.obs.organism_ontology_term_id[0])
AnnDataAccessor object with n_obs × n_vars = 244474 × 30933
  constructed for the AnnData object 0129dbd9-a7d3-4f6b-96b9-1da155a93748.h5ad
    obs: ['_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'group', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'n_counts', 'organism', 'organism_ontology_term_id', 'sample_preservation_method', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_enriched_cell_types', 'suspension_enrichment_factors', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id']
    obsm: ['LVG embedding', 'X_umap', 'cluster memberships', 'embedding', 'precluster denoised', 'precluster embedding']
    raw: ['X', 'var', 'varm']
    uns: ['default_embedding', 'schema_version', 'title']
    var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference']
dividing the dataset as it is too large: 18Gb
num blocks  4
AnnData object with n_obs × n_vars = 67500 × 30933
    obs: 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_enriched_cell_types', 'suspension_enrichment_factors', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'sex_ontology_term_id', 'group', 'n_counts', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding', 'schema_version', 'title'
    obsm: 'LVG embedding', 'X_umap', 'cluster memberships', 'embedding', 'precluster denoised', 'precluster embedding'
Dropping layers:  KeysView(Layers with keys: )
checking raw counts
removed 0 non primary cells, 67500 renamining
filtered out 0 cells, 67500 renamining
Removed 0 genes.
validating
startin QC
Seeing 8791 outliers (13.02% of total dataset):
done
! data is an AnnData, please use .from_anndata()
! `is_new_version_of` will be removed soon, please use `revises`
! didn't pass the latest version in `revises`, retrieved it: Artifact(uid='wYiUe9hn4TJijpoX90Mr', version='2024-07-01', is_latest=True, description='All major cell types in adult human retina', key='cell-census/2024-07-01/h5ads/0129dbd9-a7d3-4f6b-96b9-1da155a93748.h5ad', suffix='.h5ad', type='dataset', size=14638089351, hash='bXxaz_quQ4mIbVlarLZZKQ', n_observations=244474, _hash_type='md5-n', _accessor='AnnData', visibility=1, _key_is_virtual=False, created_by_id=1, storage_id=2, transform_id=22, run_id=27, updated_at='2024-07-12 12:40:43 UTC')
! no run & transform get linked, consider calling ln.context.track()
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/anndata/_core/anndata.py:1209: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/anndata/_core/anndata.py:1209: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/anndata/_core/anndata.py:1209: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/anndata/_core/anndata.py:1209: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/anndata/_core/anndata.py:1209: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c
<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/anndata/_core/anndata.py:1209: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  df[key] = c
Traceback (most recent call last):
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/base.py", line 517, in __init__
    rel_obj = kwargs.pop(field.name)
KeyError: 'created_by'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/base.py", line 522, in __init__
    val = kwargs.pop(field.attname)
KeyError: 'created_by_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lnschema_core/users.py", line 19, in query_user_id
    user_id = User.objects.get(uid=settings.user.uid).id
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/query.py", line 649, in get
    raise self.model.DoesNotExist(
lnschema_core.models.User.DoesNotExist: User matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/bin/scdataloader", line 8, in <module>
    sys.exit(main())
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/scdataloader/__main__.py", line 198, in main
    preprocessor(
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/scdataloader/preprocess.py", line 497, in __call__
    raise e
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/scdataloader/preprocess.py", line 458, in __call__
    myfile = ln.Artifact(
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lamindb/_artifact.py", line 652, in __init__
    super(Artifact, artifact).__init__(**kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lamindb/_record.py", line 116, in __init__
    super(Record, record).__init__(**kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lnschema_core/models.py", line 103, in __init__
    super().__init__(*args, **kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lnschema_core/models.py", line 182, in __init__
    super().__init__(*args, **kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lnschema_core/models.py", line 212, in __init__
    super().__init__(*args, **kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/base.py", line 524, in __init__
    val = field.get_default()
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/fields/related.py", line 1127, in get_default
    field_default = super().get_default()
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/fields/__init__.py", line 1016, in get_default
    return self._get_default()
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lnschema_core/users.py", line 27, in current_user_id
    user_id_cache[settings.instance.slug] = query_user_id()
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/lnschema_core/users.py", line 22, in query_user_id
    user_id = User.objects.get(uid=settings.user.uid).id
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/manager.py", line 87, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "<snip-venv-dir>/scprint-H02B7vuc-py3.10/lib/python3.10/site-packages/django/db/models/query.py", line 649, in get
    raise self.model.DoesNotExist(
lnschema_core.models.User.DoesNotExist: User matching query does not exist.

I'm assuming that this is because my username isn't associated with the cellxgene collection on lamin

>>> ln.User.df()
         uid     handle            name                       updated_at
id
7     <snip>  testuser1      Test User1 2024-09-25 11:35:18.103302+00:00
6     <snip>    Koncopd  Sergei Rybakov 2024-09-19 10:17:12.841885+00:00
1     <snip>  sunnyosun       Sunny Sun 2023-12-13 16:23:44.195541+00:00
2     <snip>  falexwolf       Alex Wolf 2023-10-19 11:14:17.050814+00:00

The relevant line of code in preprocessing.py is

myfile = ln.Artifact(
    block,
    is_new_version_of=file,
    description=description,
    version=str(version) + "_s" + str(i),
)

edit: I thought the problem occured while processing is_new_version_of, but I tried commenting it and I still get the user not found error.

(Also, I'm running this to try to construct the data needed to re-train scPRINT. Hopefully not barking up the wrong tree here but please let me know if so!)

Thanks in advance for any advice

jkobject commented 1 week ago

Hello Guy,

Sorry about that. Were you able to run scPRINT at least on a small test example?

DId the lamin login step worked? Lamin often asks for your a password at first login.

Be careful about the start_at=39 too. It means scdataloader will start at the 39'th dataset instead of the first (this is useful when you restart a failed run).

jkobject commented 1 week ago

I'm assuming that this is because my username isn't associated with the cellxgene collection on lamin

I never had this issue on my end and I am not part of the lamindb team so I don't think this is the reason

jkobject commented 1 week ago

I am thinking about one thing... you might need to create a lamin instance first: https://docs.lamin.ai/introduction#quickstart

# store artifacts in local directory `./lamin-intro`
!lamin init --storage ./lamin-intro --schema bionty
# (optional) make Django's unnecessary functionality private for clean auto-complete
!lamin settings set private-django-api true