isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

Row in OpenContext iSB failing to be ingested in iSC #343

Open dannymandel opened 8 months ago

dannymandel commented 8 months ago

Was going through the logs looking at the OpenContext sitemap ingest from iSB to iSC and saw this error:

2024-01-03T03:13:58 root:INFO: About to process 4571 things
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1880, in _execute_context
    self.dialect.do_executemany(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 1000, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.NotNullViolation: null value in column "resolved_url" of relation "thing" violates not-null constraint
DETAIL:  Failing row contains (6088737, {"id": "http://opencontext.org/subjects/f233b497-13fd-4107-894f-..., ["ark:/28722/k2vt23934", "ark:/28722/k2vt23934"], ark:/28722/k2vt23934, 2024-01-03 03:13:59.130918, 2023-11-10 18:30:17, sample, OPENCONTEXT, null, 200, 2023-11-15 00:58:32.62224, 0, application/json, 8f2db3c142c30e9).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/scripts/consume_sitemaps.py", line 289, in <module>
    main()
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/scripts/consume_sitemaps.py", line 101, in main
    fetch_sitemap_files(
  File "/app/scripts/consume_sitemaps.py", line 276, in fetch_sitemap_files
    db_session.bulk_update_mappings(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 3822, in bulk_update_mappings
    self._bulk_save_mappings(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 3861, in _bulk_save_mappings
    with util.safe_reraise():
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 3842, in _bulk_save_mappings
    persistence._bulk_update(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/orm/persistence.py", line 182, in _bulk_update
    _emit_update_statements(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/orm/persistence.py", line 1001, in _emit_update_statements
    c = connection._execute_20(
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/sql/elements.py", line 333, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1880, in _execute_context
    self.dialect.do_executemany(
  File "/usr/local/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 1000, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.IntegrityError: (psycopg2.errors.NotNullViolation) null value in column "resolved_url" of relation "thing" violates not-null constraint
DETAIL:  Failing row contains (6088737, {"id": "http://opencontext.org/subjects/f233b497-13fd-4107-894f-..., ["ark:/28722/k2vt23934", "ark:/28722/k2vt23934"], ark:/28722/k2vt23934, 2024-01-03 03:13:59.130918, 2023-11-10 18:30:17, sample, OPENCONTEXT, null, 200, 2023-11-15 00:58:32.62224, 0, application/json, 8f2db3c142c30e9).
dannymandel commented 8 months ago

Debugging in the db:

isb_1=# select * from thing where id='ark:/28722/k2vt23934';
          id          |            tstamp            |        tcreated        | item_type | authority_id | related | log | resolved_url | resolved_status |          tresolved           | resolve_elapsed |      

                                                                                                  resolved_content                                                                                                

        | resolved_media_type |   _id   |       identifiers        |       h3        
----------------------+------------------------------+------------------------+-----------+--------------+---------+-----+--------------+-----------------+------------------------------+-----------------+------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------+---------------------+---------+--------------------------+-----------------
 ark:/28722/k2vt23934 | 2023-11-15 00:58:32.62224+00 | 2023-11-10 18:30:17+00 | sample    | OPENCONTEXT  |         |     |              |             200 | 2023-11-15 00:58:32.62224+00 |               0 | {"id"
: "http://opencontext.org/subjects/f233b497-13fd-4107-894f-8776f4205d59", "uri": "http://opencontext.org/subjects/f233b497-13fd-4107-894f-8776f4205d59", "href": "https://staging.opencontext.org/subjects/f233b49
7-13fd-4107-894f-8776f4205d59", "icon": "https://staging.opencontext.org/static/oc/icons-v2/object-icon-draft-2.svg", "label": "A080114", "Creator": [{"id": "http://opencontext.org/persons/fc2a17fc-6edb-41d8-89
1e-f231bfd8c23f", "label": "Douglas R Clark"}, {"id": "http://opencontext.org/persons/43220a42-501d-4961-a01c-28d6ec145da7", "label": "Larry G Herr"}], "License": [{"id": "https://creativecommons.org/licenses/b
y/4.0", "label": "Attribution 4.0 International (CC BY 4.0)"}], "context": {"id": "http://opencontext.org/subjects/8c770f83-ca04-b90c-9fa8-dc8548902474", "label": "Asia/Jordan/Tall al-Umayri/Field H/Square 7K22
/Locus H-7K22-064"}, "project": {"id": "http://opencontext.org/projects/2015946b-da6d-4a0f-b676-03bee1efad21", "label": "Madaba Plains Project-`Umayri"}, "updated": "2023-11-10T18:30:17Z", "latitude": 31.868494
612961918, "longitude": 35.887987861855535, "published": "2007-01-01T00:00:00Z", "late bce/ce": -950.0, "citation uri": "https://n2t.net/ark:/28722/k2vt23934", "early bce/ce": -1200.0, "item category": "Object"
, "isam:SamplingSite": {"id": "https://opencontext.org/subjects/38023c67-71a3-44b9-ac47-af7d5672790b", "label": "Tall al-Umayri", "identifier": "https://opencontext.org/subjects/38023c67-71a3-44b9-ac47-af7d5672
790b"}} | application/json    | 7766318 | ["ark:/28722/k2vt23934"] | 8f2db3c142c30e9
(1 row)

isb_1=# select resolved_url from thing where id='ark:/28722/k2vt23934';
 resolved_url 
--------------

(1 row)
dannymandel commented 8 months ago
isb_1=# select count(*) from thing where resolved_url is NULL;
 count 
-------
   571
(1 row)
dannymandel commented 8 months ago

It looks like this code is the problem:

            # TODO: this is a hack and should ideally be replaced on the OC side.
            if next_url is not None:
                self.url = next_url.replace("ALL-STANDARD-LD", "iSamples")
            else:
                self.url = None

In this case, I think we should just keep the current url. The other thing we'll need to double check, however, is to make sure that we properly exit the iterator if we get rid of that condition.

dannymandel commented 8 months ago

In the meantime, since this field is just for bookkeeping, I think it's safe to set it to a dummy value.

dannymandel commented 8 months ago

On further analysis, I'm not so sure that that code is problematic. We should have exhausted all the records in the iterator by the time we hit the self.url = None line.

dannymandel commented 8 months ago

I've dropped the not null constraint on the resolved_url column on iSC. The update on iSC should complete tonight. Note that the database on iSB didn't have the not null constraint on the column, which is the source of the problem.

dannymandel commented 8 months ago

I'll leave this open and check tomorrow to ensure the update completes as expected.

dannymandel commented 8 months ago

I still think there is a bug in the record harvester, however. There are 571 records with no resolved_url, and look at this:

isb_1=# select max (_id) from thing;
   max   
---------
 7766888
(1 row)

isb_1=# select max(_id) from thing where resolved_url is null;
   max   
---------
 7766888
(1 row)

isb_1=# select min(_id) from thing where resolved_url is null;
   min   
---------
 7766318
(1 row)

7766888 - 7766318 = 570

Certainly feels like the last chunk of records we got just didn't have a resolved_url associated with them.