overhangio / tutor

The Docker-based Open edX distribution designed for peace of mind
https://docs.tutor.overhang.io/
GNU Affero General Public License v3.0
916 stars 436 forks source link

k8s deployment fails #295

Closed ChloeOCB closed 3 years ago

ChloeOCB commented 4 years ago

Bug description

Not able to deploy Openedx with Tutor on Kubernetes cluster

How to reproduce

Running tutor twice and getting two different behaviors

First try

$ tutor k8s quickstart
==================================================
        Interactive platform configuration
==================================================
Your website domain name for students (LMS) [www.myopenedx.com]
Your website domain name for teachers (CMS) [studio.www.myopenedx.com]
Your platform name/title [My Open edX]
Your public contact email address [contact@www.myopenedx.com]
The default language code for the platform [en]
Activate SSL/TLS certificates for HTTPS access? Important note: this will NOT work in a development environment. [y/N]
Configuration saved to /home/cloud/.local/share/tutor/config.yml
================================================
        Updating the current environment
================================================
Environment generated in /home/cloud/.local/share/tutor/env
=====================================
        Starting the platform
=====================================
kubectl apply --kustomize /home/cloud/.local/share/tutor/env --wait --selector app.kubernetes.io/component=namespace
namespace/openedx unchanged
kubectl apply --kustomize /home/cloud/.local/share/tutor/env --wait --selector app.kubernetes.io/component=volume
persistentvolumeclaim/elasticsearch created
persistentvolumeclaim/minio created
persistentvolumeclaim/mongodb created
....
....
================================================
        Database creation and migrations
================================================
Waiting for a mysql pod to be ready...
kubectl wait --namespace openedx --selector=app.kubernetes.io/instance=openedx-mQBKhLDUq8mj2hVCKnNtxvDi,app.kubernetes.io/name=mysql --for=condition=ContainersReady --timeout=600s pod
error: no matching resources found
Error: Command failed with status 1: kubectl wait --namespace openedx --selector=app.kubernetes.io/instance=openedx-mQBKhLDUq8mj2hVCKnNtxvDi,app.kubernetes.io/name=mysql --for=condition=ContainersReady --timeout=600s pod

Second try

==================================================
        Interactive platform configuration
==================================================
Your website domain name for students (LMS) [www.myopenedx.com]
Your website domain name for teachers (CMS) [studio.www.myopenedx.com]
Your platform name/title [My Open edX]
Your public contact email address [contact@www.myopenedx.com]
The default language code for the platform [en]
Activate SSL/TLS certificates for HTTPS access? Important note: this will NOT work in a development environment. [y/N]
Configuration saved to /home/cloud/.local/share/tutor/config.yml
================================================
        Updating the current environment
================================================
Environment generated in /home/cloud/.local/share/tutor/env
=====================================
        Starting the platform
=====================================
kubectl apply --kustomize /home/cloud/.local/share/tutor/env --wait --selector app.kubernetes.io/component=namespace
namespace/openedx unchanged
kubectl apply --kustomize /home/cloud/.local/share/tutor/env --wait --selector app.kubernetes.io/component=volume
persistentvolumeclaim/elasticsearch unchanged
persistentvolumeclaim/minio unchanged
persistentvolumeclaim/mongodb unchanged
persistentvolumeclaim/mysql unchanged
persistentvolumeclaim/rabbitmq unchanged
....
....
================================================
        Database creation and migrations
================================================
Waiting for a mysql pod to be ready...
kubectl wait --namespace openedx --selector=app.kubernetes.io/instance=openedx-mQBKhLDUq8mj2hVCKnNtxvDi,app.kubernetes.io/name=mysql --for=condition=ContainersReady --timeout=600s pod
pod/mysql-657b8df849-qphqb condition met
....
....
Initialising MySQL...
Warning: Using a password on the command line interface can be insecure.
MySQL is up and running
Warning: Using a password on the command line interface can be insecure.
Warning: Using a password on the command line interface can be insecure.
Plugin minio: running pre-init for service minio...
Waiting for a minio pod to be ready...
kubectl wait --namespace openedx --selector=app.kubernetes.io/instance=openedx-mQBKhLDUq8mj2hVCKnNtxvDi,app.kubernetes.io/name=minio --for=condition=ContainersReady --timeout=600s pod
pod/minio-bf486dd9d-6gmhj condition met
Finding pod name for minio deployment...
....
....
Added `minio` successfully.
....
....
kubectl exec --namespace openedx lms-54bc469496-82qpk -- sh -e -c dockerize -wait tcp://mysql:3306 -timeout 20s

export DJANGO_SETTINGS_MODULE=$SERVICE_VARIANT.envs.$SETTINGS
echo "Loading settings $DJANGO_SETTINGS_MODULE"

./manage.py lms migrate

./manage.py lms create_oauth2_client \
    "http://androidapp.com" "http://androidapp.com/redirect" public \
    --client_id android --client_secret VbdN477qHCuYpd6gpTziyJBF \
    --trusted

# Fix incorrect uploaded file path
if [ -d /openedx/data/uploads/ ]; then
  if [ -n "$(ls -A /openedx/data/uploads/)" ]; then
    echo "Migrating LMS uploaded files to shared directory"
    mv /openedx/data/uploads/* /openedx/media/
    rm -rf /openedx/data/uploads/
  fi
fi
2020/02/18 11:16:08 Waiting for: tcp://mysql:3306
2020/02/18 11:16:08 Connected to tcp://mysql:3306
Loading settings lms.envs.tutor.production
WARNING:py.warnings:/openedx/edx-platform/lms/djangoapps/courseware/__init__.py:5: DeprecationWarning: Importing 'lms.djangoapps.courseware' as 'courseware' is no longer supported
  warnings.warn("Importing 'lms.djangoapps.courseware' as 'courseware' is no longer supported", DeprecationWarning)

2020-02-18 11:16:12,602 WARNING 56 [enterprise.utils] utils.py:50 - Could not import Registry from third_party_auth.provider
2020-02-18 11:16:12,602 WARNING 56 [enterprise.utils] utils.py:51 - cannot import name _LTI_BACKENDS
Operations to perform:
  Apply all migrations: admin, api_admin, assessment, auth, badges, block_structure, bookmarks, branding, bulk_email, catalog, celery_utils, certificates, commerce, completion, consent, content_type_gating, contentserver, contenttypes, cors_csrf, course_action_state, course_duration_limits, course_goals, course_groups, course_modes, course_overviews, courseware, crawlers, credentials, credit, dark_lang, database_fixups, degreed, django_comment_common, django_notify, django_openid_auth, djcelery, edx_oauth2_provider, edx_proctoring, edxval, email_marketing, embargo, enterprise, entitlements, experiments, external_auth, grades, instructor_task, integrated_channel, lms_xblock, microsite_configuration, milestones, mobile_api, notes, oauth2, oauth2_provider, oauth_dispatch, oauth_provider, organizations, programs, redirects, rss_proxy, sap_success_factors, schedules, self_paced, sessions, shoppingcart, site_configuration, sites, social_django, splash, static_replace, status, student, submissions, survey, teams, theming, third_party_auth, track, user_api, user_authn, util, verified_track_content, verify_student, video_config, video_pipeline, waffle, waffle_utils, wiki, workflow, xapi, xblock_django
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  ....
  ....
  Applying certificates.0003_data__default_modes...Traceback (most recent call last):
  File "./manage.py", line 123, in <module>
    execute_from_command_line([sys.argv[0]] + django_args)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/management/commands/migrate.py", line 204, in handle
    fake_initial=fake_initial,
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/migrations/executor.py", line 115, in migrate
    state = self._migrate_all_forwards(state, plan, full_plan, fake=fake, fake_initial=fake_initial)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/migrations/executor.py", line 145, in _migrate_all_forwards
    state = self.apply_migration(state, migration, fake=fake, fake_initial=fake_initial)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/migrations/executor.py", line 244, in apply_migration
    state = migration.apply(state, schema_editor)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/migrations/migration.py", line 126, in apply
    operation.database_forwards(self.app_label, schema_editor, old_state, project_state)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/migrations/operations/special.py", line 193, in database_forwards
    self.code(from_state.apps, schema_editor)
  File "/openedx/edx-platform/lms/djangoapps/certificates/migrations/0003_data__default_modes.py", line 24, in forwards
    File(open(settings.PROJECT_ROOT / 'static' / 'images' / 'default-badges' / file_name))
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/fields/files.py", line 94, in save
    self.name = self.storage.save(name, content, max_length=self.field.max_length)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/files/storage.py", line 54, in save
    return self._save(name, content)
  File "/openedx/venv/local/lib/python2.7/site-packages/storages/backends/s3boto.py", line 409, in _save
    key = self.bucket.get_key(encoded_name)
  File "/openedx/venv/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 193, in get_key
    key, resp = self._get_key_internal(key_name, headers, query_args_l)
  File "/openedx/venv/local/lib/python2.7/site-packages/boto/s3/bucket.py", line 200, in _get_key_internal
    query_args=query_args)
  File "/openedx/venv/local/lib/python2.7/site-packages/boto/s3/connection.py", line 665, in make_request
    retry_handler=retry_handler
  File "/openedx/venv/local/lib/python2.7/site-packages/boto/connection.py", line 1071, in make_request
    retry_handler=retry_handler)
  File "/openedx/venv/local/lib/python2.7/site-packages/boto/connection.py", line 1030, in _mexe
    raise ex
socket.gaierror: [Errno -2] Name or service not known
command terminated with exit code 1
Error: Command failed with status 1: kubectl exec --namespace openedx lms-54bc469496-82qpk -- sh -e -c dockerize -wait tcp://mysql:3306 -timeout 20s

export DJANGO_SETTINGS_MODULE=$SERVICE_VARIANT.envs.$SETTINGS
echo "Loading settings $DJANGO_SETTINGS_MODULE"

./manage.py lms migrate

./manage.py lms create_oauth2_client \
    "http://androidapp.com" "http://androidapp.com/redirect" public \
    --client_id android --client_secret VbdN477qHCuYpd6gpTziyJBF \
    --trusted

# Fix incorrect uploaded file path
if [ -d /openedx/data/uploads/ ]; then
  if [ -n "$(ls -A /openedx/data/uploads/)" ]; then
    echo "Migrating LMS uploaded files to shared directory"
    mv /openedx/data/uploads/* /openedx/media/
    rm -rf /openedx/data/uploads/
  fi
fi

The first time, deployment fails immediately whereas the second time and without any change, deployment goes further but stops again and throws python exception

Environment

OS Ubuntu 16.04 tutor version 3.11.4 k8s server version 1.14.1 k8s client version 1.14.3

regisb commented 4 years ago

Hi @ChloeOCB! Are all your deployments up and running? You can view them in the K8s dashboard.

ChloeOCB commented 4 years ago

Hi @regisb,

Deployments are all up

$ kubectl get deployment -n openedx
NAME            READY   UP-TO-DATE   AVAILABLE   AGE
cms             1/1     1            1           90m
cms-worker      1/1     1            1           90m
elasticsearch   1/1     1            1           90m
forum           1/1     1            1           90m
lms             1/1     1            1           90m
lms-worker      1/1     1            1           90m
memcached       1/1     1            1           90m
minio           1/1     1            1           90m
mongodb         1/1     1            1           90m
mysql           1/1     1            1           90m
nginx           1/1     1            1           90m
rabbitmq        1/1     1            1           90m
smtp            1/1     1            1           90m

Pods are up and running even if many restarted or are always restarting

$ kubectl get pods -n openedx
NAME                             READY   STATUS    RESTARTS   AGE
cms-bd7ccb69b-lftz7              1/1     Running   0          79m
cms-worker-559656496c-q4rbd      1/1     Running   4          79m
elasticsearch-799bbc7f4d-k9snv   1/1     Running   0          79m
forum-64d4598cbf-2gt7k           1/1     Running   7          79m
lms-54bc469496-hsszm             1/1     Running   3          79m
lms-worker-5dc86f6f46-2cngg      1/1     Running   4          79m
memcached-7d68b8875-fw98z        1/1     Running   0          79m
minio-bf486dd9d-rc4n5            1/1     Running   0          79m
mongodb-7df787764-jtskf          1/1     Running   0          79m
mysql-657b8df849-m2dlg           1/1     Running   0          79m
nginx-57d56f4fdb-j5lp4           1/1     Running   0          79m
rabbitmq-76f9f4844b-z86km        1/1     Running   0          79m
smtp-96dbf5995-dpplj             1/1     Running   0          79m

But services are not all available.

I can access to the minio interface but not to openedx interface.

The most meaningful logs is on the cms and lms pods:

$ kubectl logs cms-bd7ccb69b-lftz7 -n openedx
2020-02-27 10:40:12,818 ERROR 12 [root] signals.py:21 - Uncaught exception from None
Traceback (most recent call last):
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/handlers/exception.py", line 41, in inner
    response = get_response(request)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 244, in _legacy_get_response
    response = middleware_method(request)
  File "/openedx/venv/local/lib/python2.7/site-packages/edx_django_utils/monitoring/middleware.py", line 119, in process_request
    if self._is_enabled():
  File "/openedx/venv/local/lib/python2.7/site-packages/edx_django_utils/monitoring/middleware.py", line 208, in _is_enabled
    return waffle.switch_is_active(u'edx_django_utils.monitoring.enable_memory_middleware')
  File "/openedx/venv/local/lib/python2.7/site-packages/waffle/__init__.py", line 23, in switch_is_active
    switch = Switch.get(switch_name)
  File "/openedx/venv/local/lib/python2.7/site-packages/waffle/models.py", line 50, in get
    obj = cls.objects.get(name=name)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 374, in get
    num = len(clone)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 232, in __len__
    self._fetch_all()
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 1121, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 53, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
  File "/openedx/venv/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
ProgrammingError: (1146, "Table 'openedx.waffle_switch' doesn't exist")
2020-02-27 10:40:12,819 ERROR 12 [django.request] exception.py:135 - Internal Server Error: /
$ kubectl logs lms-54bc469496-hsszm -n openedx
WARNING:py.warnings:/openedx/edx-platform/lms/djangoapps/courseware/__init__.py:5: DeprecationWarning: Importing 'lms.djangoapps.courseware' as 'courseware' is no longer supported
  warnings.warn("Importing 'lms.djangoapps.courseware' as 'courseware' is no longer supported", DeprecationWarning)

WARNING:py.warnings:/openedx/edx-platform/lms/djangoapps/courseware/__init__.py:5: DeprecationWarning: Importing 'lms.djangoapps.courseware' as 'courseware' is no longer supported
  warnings.warn("Importing 'lms.djangoapps.courseware' as 'courseware' is no longer supported", DeprecationWarning)

2020-02-27 09:44:34,405 WARNING 10 [enterprise.utils] utils.py:50 - Could not import Registry from third_party_auth.provider
2020-02-27 09:44:34,405 WARNING 10 [enterprise.utils] utils.py:51 - cannot import name _LTI_BACKENDS
2020-02-27 09:44:34,429 WARNING 12 [enterprise.utils] utils.py:50 - Could not import Registry from third_party_auth.provider
2020-02-27 09:44:34,430 WARNING 12 [enterprise.utils] utils.py:51 - cannot import name _LTI_BACKENDS
2020-02-27 10:07:52,264 ERROR 12 [django.security.DisallowedHost] exception.py:80 - Invalid HTTP_HOST header: 'XX.XX.XXX.XXX:32153'. You may need to add u'XX.XX.XXX.XXX' to ALLOWED_HOSTS.
10.233.100.0 - - [27/Feb/2020:10:07:52 +0000] "GET / HTTP/1.1" 400 26 "android-app://com.google.android.googlequicksearchbox" "Mozilla/5.0 (Linux; Android 8.0.0; WAS-LX1A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.62 Mobile Safari/537.36"
2020-02-27 10:07:52,855 ERROR 10 [django.security.DisallowedHost] exception.py:80 - Invalid HTTP_HOST header: 'XX.XX.XXX.XXX:32153'. You may need to add u'XX.XX.XXX.XXX' to ALLOWED_HOSTS.
10.233.100.0 - - [27/Feb/2020:10:07:53 +0000] "GET /favicon.ico HTTP/1.1" 400 26 "http://XX.XX.XXX.XXX:32153/" "Mozilla/5.0 (Linux; Android 8.0.0; WAS-LX1A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.62 Mobile Safari/537.36"

As deployment failed during database creation and migration, it seems that a table was not created.

Is there any workaround ?

Thanks.

regisb commented 4 years ago

Hi @ChloeOCB! Sorry about the late answer.

As deployment failed during database creation and migration, it seems that a table was not created.

Indeed, this is probably what happened. I suggest you delete the volumes attached to the mysql pod and run tutor k8s init again.

ChloeOCB commented 4 years ago

Hi @regisb,

I have never run tutor k8s init as it is not mentionned in the documentation. Actually, I only run tutor k8s quickstart. Do I miss something ?

I tried your tip but it is always the same errors.

Are you able to reproduce this problem ?

regisb commented 4 years ago

@ChloeOCB I think I found the issue: the MinIO service cannot be found as indicated by: socket.gaierror: [Errno -2] Name or service not known. This is probably due to the fact that the MinIO host is set to "minio.www.myopenedx.com", and you most certainly do not own the "myopenedx.com" domain name.

Are you deploying to a live production environment? Then you need to configure your DNS records to point at your Kubernetes cluster and you need to configure your platform to actually use these domain names (during quickstart).

regisb commented 4 years ago

@ChloeOCB can I close this?

ChloeOCB commented 4 years ago

Hi @regisb sorry for the delay.

Unfortunately I tried with another owned domain name and the issue is always the same.

It is not for a live production environment, only for testing.

I can not investigate further for now.

Any other idea about this issue ?

regisb commented 4 years ago

@ChloeOCB just wanted to say I did not forget about you. I'm currently in the process of improving the k8s stack and it should address some of your issues.

regisb commented 3 years ago

This should be working now that jobs wait for services to become available.

maitrungduc1410 commented 3 years ago

as of Feb 2021 this probem still persists, here's my logs:

...
================================================

        Database creation and migrations

================================================

Waiting for a mysql pod to be ready...

kubectl wait --namespace openedx --selector=app.kubernetes.io/instance=openedx-laLjoz1coNV3xcJjWJDEQuAM,app.kubernetes.io/name=mysql --for=condition=ContainersReady --timeout=600s pod

error: no matching resources found

Error: Command failed with status 1: kubectl wait --namespace openedx --selector=app.kubernetes.io/instance=openedx-laLjoz1coNV3xcJjWJDEQuAM,app.kubernetes.io/name=mysql --for=condition=ContainersReady --timeout=600s pod

Look same with logs of @ChloeOCB.

If I try

tutor -r . k8s stop # stop current deployments
tutor k8s quickstart # run quickstart again

I seems work, mysql job start running, but then it fail because can't connect to mysql. I try but can't login to mysql, seem password somehow broken

regisb commented 3 years ago

@maitrungduc1410 please post your questions on the forums: https://docs.tutor.overhang.io/troubleshooting.html Try to add as much information as possible, including logs from your mysql container (not the mysql-job container).