laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
126 stars 10 forks source link

error when trying to save a file #1349

Closed jkobject closed 4 months ago

jkobject commented 10 months ago

Getting this error when trying to save a file

IntegrityError: FOREIGN KEY constraint failed

seems like I had initially 2 processes that were saving files and I got this error.

Now I stopped one of them but still get it

jkobject commented 10 months ago

the path seems to already exist in "/home/ml4ig1/.cache/lamindb/"

@falexwolf would you know how to solve this?

jkobject commented 10 months ago

Also sometime I would like to directly access an h5ad file in the cache but sc.read_h5ad doesn't seem to work on these h5ads...

Koncopd commented 10 months ago

Are you sure you didn't upload on the files? What is the full error text?

falexwolf commented 10 months ago

Yes, we need the full traceback, Jeremie.

sc.read_h5ad should work on the cached h5ads, they are unchanged.

jkobject commented 10 months ago

Yes I figure I did something wrong.. I have that only for a dozen files. Here is the full stack trace:

IntegrityError                            Traceback (most recent call last)
File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/utils.py:98, in DatabaseErrorWrapper.__ca
ll__.<locals>.inner(*args, **kwargs)
     97 with self:
---> 98     return func(*args, **kwargs)

IntegrityError: FOREIGN KEY constraint failed

The above exception was the direct cause of the following exception:

IntegrityError                            Traceback (most recent call last)
Cell In[58], line 1
----> 1 preprocessed_dataset = do_preprocess(cx_dataset, start_at=104)

File ~/Documents code/scPRINT/scprint/dataset/preprocess.py:177, in Preprocessor.__call__(self, data, name, description, start_at)
    170             print("the old file is already in the local")
    171             myfile = ln.File(
    172                 adata,
    173                 is_new_version_of=ln.File.filter(uid=file.uid)[0],
    174                 description="preprocessed by scprint",
    175             )
--> 177     myfile.save()
    178     files.append(myfile)
    179 dataset = ln.Dataset(files, name=name, description=description)

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/lamindb/_file.py:922, in save(self, *args, **kwargs)
    921 def save(self, *args, **kwargs) -> None:
--> 922     self._save_skip_storage(*args, **kwargs)
    923     from lamindb._save import check_and_attempt_clearing, check_and_attempt_upload
    925     exception = check_and_attempt_upload(self)

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/lamindb/_file.py:936, in _save_skip_storage(file, *args, **kwargs)
    934 def _save_skip_storage(file, *args, **kwargs) -> None:
    935     save_feature_sets(file)
--> 936     super(File, file).save(*args, **kwargs)
    937     save_feature_set_links(file)

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/lamindb/_registry.py:465, in save(self, *args, **kwargs)
    463     init_self_from_db(self, result)
    464 else:
--> 465     super(Registry, self).save(*args, **kwargs)
    466 if db is not None and db != "default":
    467     if hasattr(self, "labels"):

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/base.py:814, in Model.save(self, force_insert, force_update, using, update_fields)
    811     if loaded_fields:
    812         update_fields = frozenset(loaded_fields)
--> 814 self.save_base(
    815     using=using,
    816     force_insert=force_insert,
    817     force_update=force_update,
    818     update_fields=update_fields,
    819 )

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/base.py:877, in Model.save_base(self, raw, force_insert, force_update, using, update_fields)
    875     if not raw:
    876         parent_inserted = self._save_parents(cls, using, update_fields)
--> 877     updated = self._save_table(
    878         raw,
    879         cls,
    880         force_insert or parent_inserted,
    881         force_update,
    882         using,
    883         update_fields,
    884     )
    885 # Store the database on which the object was saved
    886 self._state.db = using

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/base.py:1020, in Model._save_table
(self, raw, cls, force_insert, force_update, using, update_fields)
   1017     fields = [f for f in fields if f is not meta.auto_field]
   1019 returning_fields = meta.db_returning_fields
-> 1020 results = self._do_insert(
   1021     cls._base_manager, using, fields, returning_fields, raw
   1022 )
   1023 if results:
   1024     for value, field in zip(results[0], returning_fields):

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/base.py:1061, in Model._do_insert(self, manager, using, fields, returning_fields, raw)
   1056 def _do_insert(self, manager, using, fields, returning_fields, raw):
   1057     """
   1058     Do an INSERT. If returning_fields is defined then this method should
   1059     return the newly created data for the model.
   1060     """
-> 1061     return manager._insert(
   1062         [self],
   1063         fields=fields,
   1064         returning_fields=returning_fields,
   1065         using=using,
   1066         raw=raw,
   1067     )

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/manager.py:87, in BaseManager._get_queryset_methods.<locals>.create_method.<locals>.manager_method(self, *args, **kwargs)
     85 @wraps(method)
     86 def manager_method(self, *args, **kwargs):
---> 87     return getattr(self.get_queryset(), name)(*args, **kwargs)

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/query.py:1805, in QuerySet._insert(self, objs, fields, returning_fields, raw, using, on_conflict, update_fields, unique_fields)
   1798 query = sql.InsertQuery(
   1799     self.model,
   1800     on_conflict=on_conflict,
   1801     update_fields=update_fields,
   1802     unique_fields=unique_fields,
   1803 )
   1804 query.insert_values(fields, objs, raw=raw)
-> 1805 return query.get_compiler(using=using).execute_sql(returning_fields)

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/models/sql/compiler.py:1833, in SQLInsertCompiler.execute_sql(self, returning_fields)
   1830 elif self.connection.features.can_return_columns_from_insert:
   1831     assert len(self.query.objs) == 1
   1832     rows = [
-> 1833         self.connection.ops.fetch_returned_insert_columns(
   1834             cursor,
   1835             self.returning_params,
   1836         )
   1837     ]
   1838 else:
   1839     rows = [
   1840         (
   1841             self.connection.ops.last_insert_id(
   (...)
   1846         )
   1847     ]

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/backends/base/operations.py:213, in BaseDatabaseOperations.fetch_returned_insert_columns(self, cursor, returning_params)
    208 def fetch_returned_insert_columns(self, cursor, returning_params):
    209     """
    210     Given a cursor object that has just performed an INSERT...RETURNING
    211     statement into a table, return the newly created data.
    212     """
--> 213     return cursor.fetchone()

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/utils.py:97, in DatabaseErrorWrapper.__call__.<locals>.inner(*args, **kwargs)
     96 def inner(*args, **kwargs):
---> 97     with self:
     98         return func(*args, **kwargs)

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/utils.py:91, in DatabaseErrorWrapper.__exit__(self, exc_type, exc_value, traceback)
     89 if dj_exc_type not in (DataError, IntegrityError):
     90     self.wrapper.errors_occurred = True
---> 91 raise dj_exc_value.with_traceback(traceback) from exc_value

File ~/miniconda3/envs/scprint/lib/python3.10/site-packages/django/db/utils.py:98, in DatabaseErrorWrapper.__call__.<locals>.inner(*args, **kwargs)
     96 def inner(*args, **kwargs):
     97     with self:
---> 98         return func(*args, **kwargs)

IntegrityError: FOREIGN KEY constraint failed
jkobject commented 10 months ago

the files are in .cache but not in my instance folder (INSTANCENAME/.lamindb/files.h5ad)

falexwolf commented 10 months ago

So, you'll get a foreign key error if you haven't yet saved a dependent record. Because it errors directly on file save, it gotta be a storage, a user, a transform or a run. The File record doesn't depend on anything else. 🤔

Koncopd commented 10 months ago

I would say that it is not safe to save the same file from different processes. What happened probably is several files were corrupted when writing to the same cache from memory.

jkobject commented 10 months ago

I am getting files from cellxgene's instance and processing them, then creating a file record and saving them locally on my instance. I had hoped to do it in parallel, giving different chunks to each but it seems that dataset.file.all() returns a list with a different order in different processes...

If it were a corruption I would have expected only one file to be corrupted but for now I have around 5 problematic files. meaning that after I process them, I do a file.save, and it gives off this error...

But it is not for all my files. I have restarted a run and for example, files 1,2,3 gave an issue, the 4,5 worked, then 6 gave an issue. Now I am doing 7. The files that gave an issue are in ~/.cache/lamin but like all lamin saved h5ads I cannot open them with scanpy...

Koncopd commented 10 months ago

@falexwolf this looks like a problem with inter-instance transfer.

falexwolf commented 10 months ago

dataset.file.all()

This returns a QuerySet, which isn't ordered.

If it were a corruption I would have expected only one file to be corrupted but for now I have around 5 problematic files. meaning that after I process them, I do a file.save, and it gives off this error... But it is not for all my files. I have restarted a run and for example, files 1,2,3 gave an issue, the 4,5 worked, then 6 gave an issue. Now I am doing 7. The files that gave an issue are in ~/.cache/lamin but like all lamin saved h5ads I cannot open them with scanpy...

I agree with Sergei that this seems all due to inter-instance transfer.

The foreign key issue is likely due to a bug that we haven't somehow covered in tests.

Could you privately share the script that you're running to transfer the data? I'll debug it.

h5ads I cannot open them with scanpy...

I don't understand this one as we don't do anything to the h5ads, but I'm happy to debug.

jkobject commented 10 months ago

I don't understand this one as we don't do anything to the h5ads, but I'm happy to debug.

Here is the issue when loading an h5ad with scanpy..

Screenshot 2023-12-12 at 17 31 12

Screenshot 2023-12-12 at 17 31 25

falexwolf commented 10 months ago

Let me know if you're free to get on a call - ping me on Slack! This looks like a file not found error. 😅

jkobject commented 10 months ago

Here is the code:

if do_cache:
    for i in ln.File.filter(description=MYDESC):
        all_ready_processed_keys.add(i.initial_version.key)
for i, file in enumerate(cx_dataset.files.all()):
    # use the counts matrix
    print(i)
    if file.key in all_ready_processed_keys:
        print(f"{file.key} is already processed")
        continue
    print(file)
    if file.backed().obs.is_primary_data.sum() == 0:
        print(f"{file.key} only contains non primary cells")
        continue
    adata = file.load(stream=True)
    print(adata)
    adata = some_preprocess(adata)
    myfile = ln.File(
        adata,
        is_new_version_of=file,
        description=MYDESC,
    )
    myfile.save()
    files.append(myfile)
dataset = ln.Dataset(files, name=NAME, description=DESC)
dataset.save()
falexwolf commented 4 months ago

I'm pretty sure we simply forgot to close this.