nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

Submission.ingest triggers a sqlite syntax error in some python environments #372

Closed DavidKelly-Praedictus closed 1 year ago

DavidKelly-Praedictus commented 2 years ago

When running on Ubuntu 20.04, with either the python3.8 or python3.9 packages from the ubuntu repo, submission.ingest throws a sqlite3 syntax error.

The same data and cfde-deriva version works fine when using miniconda 4.10.3 (python 3.8). It seems like this is triggered by something in the python environment, but I haven't been able to narrow it down beyond that. This was tested using cfde-deriva commit 51f99c841506d25a5488eadb74351500e0055836.

Using miniconda solves everything for me, but I thought I should mention this as a potential issue.

DEBUG:cfde_deriva.submission: Traceback (most recent call last): 

File "/home/funcx/cfde-deriva/cfde_deriva/submission.py", line 419, in ingest
self.prepare_sqlite_derived_data(self.portal_prep_sqlite_filename, attach={"submission": self.ingest_sqlite_filename})

File "/home/funcx/cfde-deriva/cfde_deriva/submission.py", line 1029, in prepare_sqlite_derived_data
prep_dp.sqlite_do_etl(conn, submission_dp, \'submission\', progress=progress)

File "/home/funcx/cfde-deriva/cfde_deriva/datapackage.py", line 1623, in sqlite_do_etl
conn.executescript(sql)
sqlite3.OperationalError: near "FROM": syntax error

ERROR:cfde_deriva.submission:Failed with exception near "FROM": syntax error in ingest sequence with
next_error_state=cfde_registry_dp_status:content-error for datapackage 3c28563a-0c92-11ed-854e-2bc4ee7d7b12
DEBUG:cfde_deriva.submission:Updating datapackage 3c28563a-0c92-11ed-854e-2bc4ee7d7b12
status=cfde_registry_dp_status:content-error diagnostics=An unknown operational error has occurred....

And below is a list of the python modules in the standard 3.8 python virtual environment where it throws an error:

attrs==21.4.0
bagit==1.8.1
bagit-profile==1.3.1
bdbag==1.6.3
boto3==1.24.38
botocore==1.27.38
certifi==2022.6.15
-e git+https://github.com/nih-cfde/cfde-deriva.git@51f99c841506d25a5488eadb74351500e0055836#egg=cfde_deriva
# Editable install with no version control (cfde-ingest==1.0.0)
-e /home/funcx/cfde-ingest
cffi==1.15.1
chardet==5.0.0
charset-normalizer==2.1.0
click==8.1.3
colorama==0.4.5
commonmark==0.9.1
cryptography==37.0.4
decorator==5.1.1
deriva==1.5.3
fair-identifiers-client==0.5.0
fair-research-login==0.3.0
frictionless==4.40.5
globus-sdk==3.10.1
idna==3.3
importlib-resources==5.9.0
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
jsonschema==4.7.2
marko==1.2.1
MarkupSafe==2.1.1
packaging==21.3
petl==1.7.10
pika==1.3.0
portalocker==2.5.1
pycparser==2.21
Pygments==2.12.0
PyJWT==2.4.0
pyparsing==3.0.9
pyrsistent==0.18.1
python-dateutil==2.8.2
python-slugify==6.1.2
pytz==2022.1
PyYAML==6.0
requests==2.28.1
rfc3986==2.0.0
rich==12.5.1
s3transfer==0.6.0
setuptools-scm==5.0.2
shellingham==1.4.0
simpleeval==0.9.12
six==1.16.0
stringcase==1.2.0
tableschema-to-template==0.0.12
tabulate==0.8.10
text-unidecode==1.3
typer==0.6.1
typing-extensions==4.3.0
tzlocal==2.1
urllib3==1.26.11
validators==0.20.0
watchtower==3.0.0
XlsxWriter==3.0.3
zipp==3.8.1
karlcz commented 1 year ago

This is probably an issue of an older embedded sqlite version having more limited SQL syntax support in an older Python runtime. However, it is not an issue we face in production and there isn't enough diagnostic information captured here to really narrow down the offending query. The reported code is in a loop that generates and processes many different ETL queries across the whole portal model.

Closing because we aren't likely to investigate this further unless we encounter an easily reproducible failure with future test systems.