oda-hub / nb2workflow

GNU General Public License v3.0
1 stars 4 forks source link

Backend container don't start with volume provisioned through NFS #182

Closed dsavchenko closed 3 months ago

dsavchenko commented 3 months ago

On staging, even after https://github.com/oda-hub/nb2workflow/pull/180

@volodymyrss does it deploy well in prod?

volodymyrss commented 3 months ago

It works, yes. Although it takes a bit of time.

dsavchenko commented 3 months ago

That's very weird, but in my deployment the startupProbe fails constantly with "connection refused", but if I kubectl edit the corresponding deployment in-place, even something unrelated to the startup probe, e.g. the failureThreshold of the livenessProbe, the new pod is created and starts without any problem.

Any ideas, what could it be or where to look to?

dsavchenko commented 3 months ago

Hm, I finally managed to extract at least some log from the unhealthy pod

Traceback (most recent call last):
  File "/opt/conda/bin/nb2service", line 5, in <module>
    from nb2workflow.service import main
  File "/opt/conda/lib/python3.9/site-packages/nb2workflow/service.py", line 14, in <module>
    from nb2workflow import ontology, publish, schedule
  File "/opt/conda/lib/python3.9/site-packages/nb2workflow/ontology.py", line 6, in <module>
    import nb2workflow.nbadapter as nbadapter
  File "/opt/conda/lib/python3.9/site-packages/nb2workflow/nbadapter.py", line 46, in <module>
    from nb2workflow import workflows
  File "/opt/conda/lib/python3.9/site-packages/nb2workflow/workflows.py", line 15, in <module>
    cache = Cache('.nb2workflow/cache')
  File "/opt/conda/lib/python3.9/site-packages/diskcache/core.py", line 478, in __init__
    self.reset(key, value, update=False)
  File "/opt/conda/lib/python3.9/site-packages/diskcache/core.py", line 2431, in reset
    ((old_value,),) = sql(
sqlite3.OperationalError: database is locked

As I use nfs volumes with ReadWriteMany as workdir, it refuses to start because the cache is locked by the previously running container. This doesn't explain why it starts after editing deployment, though

dsavchenko commented 3 months ago

As I use nfs volumes with ReadWriteMany as workdir, it refuses to start because the cache is locked by the previously running container. This doesn't explain why it starts after editing deployment, though

Not this...

dsavchenko commented 3 months ago

Well, I think the problem is that in my installation persistent volumes are NFS and sqlite doesn't work well with it.

@volodymyrss what is this cache about, and what's this workflows module is in general? Seems we only use serialize_workflow_exception from there, so may probably be moved somewhere else

volodymyrss commented 3 months ago

It provides a homogeneous way to run workflows/tools as local nb files or as requests to different services. It was used e.g. in tests and when notebooks call other notebooks. This sort of functionality is needed, but it may not be like that and here. If this cache is a source of the issue, you can disable it.