rockstor / rockstor-core

Linux/BTRFS based Network Attached Storage(NAS)
http://rockstor.com/docs/contribute_section.html
GNU General Public License v3.0
550 stars 137 forks source link

dev name change breaking mounts #1043

Closed phillxnet closed 7 years ago

phillxnet commented 8 years ago

Corner case where drive name change breaks mount points. In looking into reports of samba not starting even after pull request:- https://github.com/rockstor/rockstor-core/pull/1032 which dealt with a bootstrap fail. I have reproduced a very low frequency issue of mount point failures due to drive name change. In my instance samba still started however the share was empty. This does not fit with a report 3.8-9.07 still having a failed samba start by @kimbl on the forum:- http://forum.rockstor.com/t/pool-not-mounted-on-reboot/731/4 but may be related as the pool is not mounted as expected on boot.

Example of this issue:-

btrfs fi show Label: 'rockstor_rockstor' uuid: 0faa780f-a339-43b7-a49f-59cfa98548b5 Total devices 1 FS bytes used 1.74GiB devid 1 size 25.47GiB used 3.27GiB path /dev/sda3

Label: 'time_machine_pool' uuid: 8f363c7d-2546-4655-b81b-744e06336b07 Total devices 3 FS bytes used 29.92GiB devid 1 size 149.05GiB used 19.00GiB path /dev/sdb devid 2 size 153.38GiB used 24.01GiB path /dev/sdc devid 3 size 149.05GiB used 19.01GiB path /dev/sdd

btrfs-progs v4.2.1

In the above we have the only non rockstor_rockstor pool as consisting of sdb, sdc, and, sdd

However Rockstor has attempted to mount our samba share (which is in the time_machine_pool) via reference to sda, however as can be seen sda is part of the system drive and as such doesn't contain any btrfs only the standard partitioning.

This results in may mount errors for the various mount points eg:-

[04/Dec/2015 15:38:34] ERROR [storageadmin.views.command:80] Exception while mounting a share(samba_share) during bootstrap: Error running a command. cmd = [' /bin/mount', '-t', 'btrfs', '-o', u'subvol=samba_share', u'/dev/sda', u'/mnt2/samba_share']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sda is already mou nted or /mnt2/samba_share busy', '']

this is soon there after followed by:-

[04/Dec/2015 15:38:35] DEBUG [storageadmin.views.command:126] Bootstrap operations completed [04/Dec/2015 16:19:53] DEBUG [smart_manager.data_collector:276] Sysinfo has been initialized [04/Dec/2015 16:19:53] DEBUG [smart_manager.data_collector:280] Sysinfo has connected [04/Dec/2015 16:19:53] DEBUG [smart_manager.data_collector:131] network stats connected [04/Dec/2015 16:19:54] DEBUG [smart_manager.data_collector:352] Disk state updated successfully [04/Dec/2015 16:19:55] ERROR [storageadmin.middleware:35] Exception occured while processing a request. Path: /api/commands/refresh-pool-state method: POST [04/Dec/2015 16:19:55] ERROR [storageadmin.middleware:36] Error running a command. cmd = ['/sbin/btrfs', 'fi', 'show', u'/dev/sda']. rc = 1. stdout = ['']. stderr = ['ERROR: No btrfs on /dev/sda', '']

As can be seen the refresh-pool-state api post is asking btrfs fi show to work on a non btrfs sda which is currently our boot drive with traditional partitioning where only the third partition is btrfs.

Traceback (most recent call last): File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/core/handlers/base.py", line 112, in get_response response = wrapped_callback(request, _callback_args, _callback_kwargs) File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/views/decorators/csrf.py", line 57, in wrapped_view return view_func(_args, _kwargs) File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/views/generic/base.py", line 69, in view return self.dispatch(request, _args, _kwargs) File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 452, in dispatch response = self.handle_exception(exc) File "/opt/rockstor/eggs/djangorestframework-3.1.1-py2.7.egg/rest_framework/views.py", line 449, in dispatch response = handler(request, _args, _kwargs) File "/opt/rockstor/eggs/Django-1.6.11-py2.7.egg/django/db/transaction.py", line 371, in inner return func(_args, *_kwargs) File "/opt/rockstor/src/rockstor/storageadmin/views/command.py", line 232, in post pool_info = get_pool_info(fd.name) File "/opt/rockstor/src/rockstor/fs/btrfs.py", line 70, in get_pool_info o, e, rc = run_command(cmd) File "/opt/rockstor/src/rockstor/system/osi.py", line 89, in run_command raise CommandException(cmd, out, err, rc) CommandException: Error running a command. cmd = ['/sbin/btrfs', 'fi', 'show', u'/dev/sda']. rc = 1. stdout = ['']. stderr = ['ERROR: No btrfs on /dev/sda', ''] [04/Dec/2015 16:19:55] ERROR [smart_manager.data_collector:354] Failed to update pool state.. exception: Internal Server Error: No JSON object could be decoded [04/Dec/2015 16:19:56] DEBUG [smart_manager.data_collector:352] Share state updated successfully [04/Dec/2015 16:19:56] DEBUG [smart_manager.data_collector:352] Snapshot state updated successfully [04/Dec/2015 16:19:57] DEBUG [smart_manager.data_collector:331] Updated Rock-on metadata. [04/Dec/2015 16:19:59] DEBUG [smart_manager.data_collector:104] disconnect received [04/Dec/2015 16:19:59] DEBUG [smart_manager.data_collector:136] network stats disconnected

Note the "Failed to update pool state.. exception: Internal Server Error: No JSON object could be decoded.

I think this issue is the root cause of a few sporadic mount / service problems.

This may relate to #897 which suspects previous drive names as disturbing subsequent "on boot" mounts.

phillxnet commented 8 years ago

From the client machine perspective in the above scenario the samba share is exported but empty. In my instance of this the samba service was started automatically however it does have some cups / printer related issues:-

systemctl status smb -l smb.service - Samba SMB Daemon Loaded: loaded (/etc/systemd/system/smb.service; enabled) Active: active (running) since Fri 2015-12-04 15:38:38 GMT; 1h 41min ago Main PID: 3173 (smbd) Status: "smbd: ready to serve connections..." CGroup: /system.slice/smb.service ├─3173 /usr/sbin/smbd ├─3174 /usr/sbin/smbd └─3182 /usr/sbin/smbd

Dec 04 16:42:40 rockstor smbd[3174]: [2015/12/04 16:42:40.708630, 0] ../source3/printing/print_cups.c:528(cups_async_callback) Dec 04 16:42:40 rockstor smbd[3174]: failed to retrieve printer list: NT_STATUS_UNSUCCESSFUL Dec 04 16:55:41 rockstor smbd[8217]: [2015/12/04 16:55:41.187834, 0] ../source3/printing/print_cups.c:151(cups_connect) Dec 04 16:55:41 rockstor smbd[8217]: Unable to connect to CUPS server localhost:631 - Transport endpoint is not connected Dec 04 16:55:41 rockstor smbd[3174]: [2015/12/04 16:55:41.188390, 0] ../source3/printing/print_cups.c:528(cups_async_callback) Dec 04 16:55:41 rockstor smbd[3174]: failed to retrieve printer list: NT_STATUS_UNSUCCESSFUL Dec 04 17:08:41 rockstor smbd[9931]: [2015/12/04 17:08:41.710182, 0] ../source3/printing/print_cups.c:151(cups_connect) Dec 04 17:08:41 rockstor smbd[9931]: Unable to connect to CUPS server localhost:631 - Transport endpoint is not connected Dec 04 17:08:41 rockstor smbd[3174]: [2015/12/04 17:08:41.710732, 0] ../source3/printing/print_cups.c:528(cups_async_callback) Dec 04 17:08:41 rockstor smbd[3174]: failed to retrieve printer list: NT_STATUS_UNSUCCESSFUL

and the rockstor-bootstrap service systemctl status rockstor-bootstrap -l rockstor-bootstrap.service - Rockstor bootstrapping tasks Loaded: loaded (/etc/systemd/system/rockstor-bootstrap.service; enabled) Active: active (exited) since Fri 2015-12-04 15:38:38 GMT; 1h 43min ago Main PID: 3020 (code=exited, status=0/SUCCESS) CGroup: /system.slice/rockstor-bootstrap.service

Dec 04 15:38:37 rockstor bootstrap[3020]: BTRFS device scan complete Dec 04 15:38:37 rockstor bootstrap[3020]: Exception occured while bootstrapping. This could be because rockstor.service is still starting up. will wait 2 seconds and try again. Exception: Exception while setting access_token for url(https://192.168.1.203): HTTPSConnectionPool(host='192.168.1.203', port=443): Max retries exceeded with url: /o/token/ (Caused by <class 'socket.error'>: [Errno 111] Connection refused). content: None Dec 04 15:38:37 rockstor bootstrap[3020]: Bootstrapping complete Dec 04 15:38:37 rockstor bootstrap[3020]: Running qgroup cleanup. /opt/rockstor/bin/qgroup-clean Dec 04 15:38:37 rockstor bootstrap[3020]: Running qgroup limit maxout. /opt/rockstor/bin/qgroup-maxout-limit

Hope this helps to narrow down the issue.

phillxnet commented 8 years ago

In the failed state the mount point /mnt2/samba_share exists but is not used / empty.

phillxnet commented 8 years ago

N.B. on power cycle to a working config where the shares are mounted the following is the arrangement:-

btrfs fi show Label: 'rockstor_rockstor' uuid: 0faa780f-a339-43b7-a49f-59cfa98548b5 Total devices 1 FS bytes used 1.74GiB devid 1 size 25.47GiB used 3.27GiB path /dev/sdd3

Label: 'time_machine_pool' uuid: 8f363c7d-2546-4655-b81b-744e06336b07 Total devices 3 FS bytes used 29.92GiB devid 1 size 149.05GiB used 19.00GiB path /dev/sda devid 2 size 153.38GiB used 24.01GiB path /dev/sdb devid 3 size 149.05GiB used 19.01GiB path /dev/sdc

With this arrangement there is an attempt to mount the /home and /root subvolumes where they were last ie on sda3:-

[04/Dec/2015 18:06:55] DEBUG [smart_manager.data_collector:276] Sysinfo has been initialized [04/Dec/2015 18:06:55] DEBUG [smart_manager.data_collector:280] Sysinfo has connected [04/Dec/2015 18:06:55] ERROR [storageadmin.views.command:80] Exception while mounting a share(home) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=home', u'/dev/sda3', u'/mnt2/home']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sda3 does not exist', ''] [04/Dec/2015 18:06:56] ERROR [storageadmin.views.command:80] Exception while mounting a share(root) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=root', u'/dev/sda3', u'/mnt2/root']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sda3 does not exist', ''] [04/Dec/2015 18:06:57] DEBUG [storageadmin.views.command:126] Bootstrap operations completed [04/Dec/2015 18:06:58] DEBUG [smart_manager.data_collector:352] Disk state updated successfully

This however is not a problem as these subvolumes are already mount.

phillxnet commented 8 years ago

A successfully share mount boot rockstor.log:-

[04/Dec/2015 19:20:43] DEBUG [smart_manager.data_collector:410] Listening on port http://127.0.0.1:8080 and on port 10843 (flash policy server) [04/Dec/2015 19:20:46] DEBUG [storageadmin.views.command:126] Bootstrap operations completed

A failed share mount boot rockstor.log:-

[04/Dec/2015 19:23:32] DEBUG [smart_manager.data_collector:410] Listening on port http://127.0.0.1:8080 and on port 10843 (flash policy server) [04/Dec/2015 19:23:34] ERROR [storageadmin.views.command:80] Exception while mounting a share(home) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=home', u'/dev/sdb3', u'/mnt2/home']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sdb3 does not exist', ''] [04/Dec/2015 19:23:34] ERROR [storageadmin.views.command:80] Exception while mounting a share(root) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=root', u'/dev/sdb3', u'/mnt2/root']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sdb3 does not exist', ''] [04/Dec/2015 19:23:34] ERROR [storageadmin.views.command:80] Exception while mounting a share(MacBackups) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=MacBackups', u'/dev/sda', u'/mnt2/MacBackups']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sda is already mounted or /mnt2/MacBackups busy', ''] [04/Dec/2015 19:23:34] ERROR [storageadmin.views.command:80] Exception while mounting a share(samba_share) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=samba_share', u'/dev/sda', u'/mnt2/samba_share']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sda is already mounted or /mnt2/samba_share busy', ''] [04/Dec/2015 19:23:34] ERROR [storageadmin.views.command:80] Exception while mounting a share(Backups) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=Backups', u'/dev/sda', u'/mnt2/Backups']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sda is already mounted or /mnt2/Backups busy', ''] [04/Dec/2015 19:23:34] DEBUG [storageadmin.views.command:126] Bootstrap operations completed

The subsequent successful share mount boot rockstor.log:-

[04/Dec/2015 19:34:28] DEBUG [smart_manager.data_collector:410] Listening on port http://127.0.0.1:8080 and on port 10843 (flash policy server) [04/Dec/2015 19:34:30] ERROR [storageadmin.views.command:80] Exception while mounting a share(home) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=home', u'/dev/sdb3', u'/mnt2/home']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sdb3 does not exist', ''] [04/Dec/2015 19:34:30] ERROR [storageadmin.views.command:80] Exception while mounting a share(root) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=root', u'/dev/sdb3', u'/mnt2/root']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sdb3 does not exist', ''] [04/Dec/2015 19:34:31] DEBUG [storageadmin.views.command:126] Bootstrap operations completed

phillxnet commented 8 years ago

Attaching a text file containing the above rockstor.log boots. success-fail-success-share-mounts.txt

phillxnet commented 8 years ago

I suspect this issue is related to the incorrect labelling of removed (possibly moved) devices so linking to a more easily reproducible example of this in issue #937 "removed a defective disk, storage disk showing wrong disk details".

Shifting my focus to that issue in the hope that it sheds light / helps with this issue also.

phillxnet commented 8 years ago

I have reproduced this issue without there being any removed / offline devices in the db and via logging added to prove the function of the fix for #937 in #1045 which will be included here to hopefully shed light on this issue. A 3 data drive system with an msata system drive was used to demonstrate this.

Share mounts can fail (intermittently on boot) by using out of date info on which drive to mount. ie:- Share mounted and working:- sdd port has drive serial 5MT10L2C (a pool member) attached sda port has drive serial 50026B724C085CD1 (system disk) attached

(system is rebooted and the drives re-arrange, no fantom missing drives due to #1045 in play)

Share mounts are emply, ie the service (smb) has started but the mounts were unsuccessful.

On boot Rockstor attempts to mount the shares The drives are now attached to the following device ports sdd port has drive serial 50026B724C085CD1 (the system disk) attached sdc port has drive serial 5MT10L2C (a pool member) attached From the below log we have 17:57:35 samba_share fails to mount and that is because it is attempting to mount on sdd however sdd is actaully the system drive (sdd used to be attached to a data/pool member drive though). And the system drive is partitioned with /boot swap and then a btrfs on the 3rd partition. Nothing to do with the data / pool / shares in this case, hence the fail) 17:58:14 we have an automated disk re-scan that now identifies and updates the db on what is where. However this is too late as the share has already failed by using an old db reference for the last location of drives, not their current location.

The above synopsis was taken from the following log of a working share mount at boot and then a failed share mount on a subsequent boot.

Hence share mount fail due to drive re-arrangements not being taken into account as _update_disk_state() has yet to be called to update the db. And is only called 39 seconds later.

3.8-9.09 reproduced blank share (failed mount) without duplicate serial entries in db. Disks re-arranged but aren't re-scanned soon enough, hence incorrect mount points. The following rockstor log is a demo of this:- N.B. the additional _update_disk_state() logging helps to demonstrate what is happening and indicates the drive port re-assignments from one boot the the next.

[12/Dec/2015 17:55:21] DEBUG [smart_manager.data_collector:280] Sysinfo has connected
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:67] Number of entries in db = 4
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:68] INITIAL DB CLEAN UP
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sdb 5JS1HNX9
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of eb0eda057c2745a8987842b399c90ae3 5JS1HNX9
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sdc PVE300Z5T1U26M
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of 28156d69337f4676b8705aad301ace6b PVE300Z5T1U26M
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sdd 5MT10L2C
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of 0013cac394df4fca8d653f648d8fe75e 5MT10L2C
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sda3 50026B724C085CD1
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of 456ca63bccc04a78a5fbdfda49179a85 50026B724C085CD1
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:99] Number of entries in db = 4
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:150] FINAL DB CONTENTS
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sdd 5MT10L2C
(here we have a final db entry for sdd as serial 5MT10L2C )
[12/Dec/2015 17:55:21] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sda3 50026B724C085CD1

(the system drive is sda3)

[12/Dec/2015 17:55:22] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sdb 5JS1HNX9
[12/Dec/2015 17:55:22] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sdc PVE300Z5T1U26M
[12/Dec/2015 17:55:22] DEBUG [smart_manager.data_collector:352] Disk state updated successfully
[12/Dec/2015 17:55:22] DEBUG [smart_manager.data_collector:352] Pool state updated successfully
[12/Dec/2015 17:55:23] DEBUG [smart_manager.data_collector:352] Share state updated successfully
[12/Dec/2015 17:55:24] DEBUG [smart_manager.data_collector:352] Snapshot state updated successfully
[12/Dec/2015 17:55:25] DEBUG [smart_manager.data_collector:362] sent update information ('3.8-9.09', '3.8-9.09', [])
[12/Dec/2015 17:55:25] DEBUG [smart_manager.data_collector:291] Sysinfo has disconnected
[12/Dec/2015 17:55:25] DEBUG [smart_manager.data_collector:331] Updated Rock-on metadata.
[12/Dec/2015 17:55:28] DEBUG [smart_manager.data_collector:362] sent update information ('3.8-9.09', '3.8-9.09', [])
[12/Dec/2015 17:55:29] DEBUG [smart_manager.data_collector:291] Sysinfo has disconnected

(rebooted and drives re-arranged)

[12/Dec/2015 17:57:32] DEBUG [smart_manager.data_collector:410] Listening on port http://127.0.0.1:8080 and on port 10843 (flash policy server)
[12/Dec/2015 17:57:35] ERROR [storageadmin.views.command:80] Exception while mounting a share(home) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=home', u'/dev/sda3', u'/mnt2/home']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sda3 does not exist', '']
[12/Dec/2015 17:57:35] ERROR [storageadmin.views.command:80] Exception while mounting a share(root) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=root', u'/dev/sda3', u'/mnt2/root']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sda3 does not exist', '']
[12/Dec/2015 17:57:35] ERROR [storageadmin.views.command:80] Exception while mounting a share(MacBackups) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=MacBackups', u'/dev/sdd', u'/mnt2/MacBackups']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sdd is already mounted or /mnt2/MacBackups busy', '']
[12/Dec/2015 17:57:35] ERROR [storageadmin.views.command:80] Exception while mounting a share(samba_share) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=samba_share', u'/dev/sdd', u'/mnt2/samba_share']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sdd is already mounted or /mnt2/samba_share busy', '']
[12/Dec/2015 17:57:35] ERROR [storageadmin.views.command:80] Exception while mounting a share(Backups) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=Backups', u'/dev/sdd', u'/mnt2/Backups']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sdd is already mounted or /mnt2/Backups busy', '']
[12/Dec/2015 17:57:36] DEBUG [storageadmin.views.command:126] Bootstrap operations completed
[12/Dec/2015 17:58:13] DEBUG [smart_manager.data_collector:276] Sysinfo has been initialized
[12/Dec/2015 17:58:13] DEBUG [smart_manager.data_collector:280] Sysinfo has connected
[12/Dec/2015 17:58:13] DEBUG [smart_manager.data_collector:131] network stats connected

(note that the drives are only now scanned and re-assigned, which is too late for the mounts)

[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:67] Number of entries in db = 4
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:68] INITIAL DB CLEAN UP
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sdd 5MT10L2C
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of fe533e7868d346549d3d12f6cee831be 5MT10L2C
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sda3 50026B724C085CD1
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of 715bae572f784860b88f1cf62e9fe522 50026B724C085CD1
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sdb 5JS1HNX9
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of b069a022e5174a3892c99691a6844a21 5JS1HNX9
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:70] PRE-PROCESS  db entry by name of sdc PVE300Z5T1U26M
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:98] POST-PROCESS db entry by name of a9d8f067f5d042ed83ed70e5ac64dc7c PVE300Z5T1U26M
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:99] Number of entries in db = 4
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:150] FINAL DB CONTENTS
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sdd3 50026B724C085CD1

(the system drive is now sdd which is why the shares fail to mount)

[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sda 5JS1HNX9
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sdb PVE300Z5T1U26M
[12/Dec/2015 17:58:14] DEBUG [storageadmin.views.disk:152] FINAL db entry by name of sdc 5MT10L2C

(so now 5MT10L2C is sdc)

[12/Dec/2015 17:58:14] DEBUG [smart_manager.data_collector:352] Disk state updated successfully
[12/Dec/2015 17:58:16] DEBUG [smart_manager.data_collector:352] Pool state updated successfully
phillxnet commented 8 years ago

Currently working on this issue again.

phillxnet commented 8 years ago

We need to execute _update_disk_state() in order to refresh the db with the new drive and pool info before we execute storageadmin.views.command with "bootstrap" command which mounts all the pools and shares using the db entries as guides.

My understanding is that rockstor-bootstrap.service (after rockstor.service) is the caller of the above bootstrap command however something else is calling it first.

I can currently only find a single "bootstrap" call in scripts/bootstrap.py where I was intending to insert a call first to invoke _update_disk_state() so that the subsequent "bootstrap" call would have up to date db info on the drives / pool etc. The following is how scripts/bootstrap.py is calling command/bootstrap via a post:-

            print('PROPOSED LOCATION for _update_disk_state CALL')
            aw = APIWrapper()
            aw.api_call('network')
            aw.api_call('commands/bootstrap', calltype='post')

scripts/bootstrap.py is invoked by rockstor-bootstrap.service via it's execution of /opt/rockstor/bin/bootstrap.

However I first need to find where else command/bootstrap is being invoked as there in lies the cause of the premature share mount, ie before db is updated.

ie:- [19/Dec/2015 17:11:54] DEBUG [storageadmin.views.command:126] Bootstrap operations completed So the Dec 19 17:11:57 rockstor bootstrap: BTRFS device scan complete Dec 19 17:11:57 rockstor bootstrap: PROPOSED LOCATION for _update_disk_state CALL Dec 19 17:11:57 rockstor bootstrap: Bootstrapping complete

I will try and continue on with this tomorrow (UK). Noting progress thus far in case anyone wants to chip in with any pointers.

phillxnet commented 8 years ago

Still exists in 3.8-10.08 but much less frequently:-

[29/Dec/2015 18:58:14] DEBUG [smart_manager.data_collector:395] Listening on port http://127.0.0.1:8080 and on port 10843 (flash policy server)
[29/Dec/2015 18:58:17] ERROR [storageadmin.views.command:80] Exception while mounting a share(home) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=home', u'/dev/sdb3', u'/mnt2/home']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sdb3 does not exist', '']
[29/Dec/2015 18:58:17] ERROR [storageadmin.views.command:80] Exception while mounting a share(root) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=root', u'/dev/sdb3', u'/mnt2/root']. rc = 32. stdout = ['']. stderr = ['mount: special device /dev/sdb3 does not exist', '']
[29/Dec/2015 18:58:17] ERROR [storageadmin.views.command:80] Exception while mounting a share(MacBackups) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=MacBackups', u'/dev/sdd', u'/mnt2/MacBackups']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sdd is already mounted or /mnt2/MacBackups busy', '']
[29/Dec/2015 18:58:17] ERROR [storageadmin.views.command:80] Exception while mounting a share(samba_share) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=samba_share', u'/dev/sdd', u'/mnt2/samba_share']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sdd is already mounted or /mnt2/samba_share busy', '']
[29/Dec/2015 18:58:17] ERROR [storageadmin.views.command:80] Exception while mounting a share(Backups) during bootstrap: Error running a command. cmd = ['/bin/mount', '-t', 'btrfs', '-o', u'subvol=Backups', u'/dev/sdd', u'/mnt2/Backups']. rc = 32. stdout = ['']. stderr = ['mount: /dev/sdd is already mounted or /mnt2/Backups busy', '']
[29/Dec/2015 18:58:18] DEBUG [storageadmin.views.command:126] Bootstrap operations completed
[29/Dec/2015 18:58:29] INFO [storageadmin.views.disk:69] update_disk_state() Called
[29/Dec/2015 18:58:29] DEBUG [smart_manager.data_collector:338] Disk state updated successfully

Note the attempt to mount the shares using info in the db before the db is updated by update_disk_state() (and it's call of scan_disks).

Note the mounts fail on this boot due to sdd now referring to the system drive.

schakrava commented 8 years ago

Thanks for all the details @phillxnet. I believe a few recent commits collectively reduced this issue. The main one being https://github.com/schakrava/rockstor-core/commit/b6081dc5ec62fed36ce58685789641416f7039b4

I've tried to reproduce it just now as follows:

  1. I have 2 pools of raid1 profile with 2 disks each
  2. There is one share per pool and they are exported via samba
  3. I powered down the machine, shuffled the disks around and rebooted. Disks came up with new names.
  4. But storage state refresh happened successfully, pools and shares got mounted properly.

So effectively, I am not able to reproduce the problem. Or atleast what I think to be the problem. One problem with this issue is that, while it has a lot of information, it lacks clear reproducible steps. So I am not sure if I am reproducing correctly. Since you( @phillxnet ) have reproduced the problem, could you try to do the same with latest master and report your findings?

phillxnet commented 8 years ago

@schakrava OK, I'll have another go, I suspect that the more drives there are the less likely it is to happen given any one of the drives in a pool can be used as a mount so by accident it can still work. However the system I used to demonstrate via the attached log segments had 3 data drives and it still happened. I don't think any commit of late has addressed the issue directly but it did seem to happen less often during the course of this issue. I actually had it happen today on a single system and single data drive and I think it's the remaining cause of our sporadic empty share reports.

Maybe the best arrangement to reproduce this is with just 2 drives one system and one data and then reverse them, as then the data drive db entries will fail (as they can only be pointing at the system drive if switched) while the system does seem to sort itself either way. Hence empty shares.

Tricky as you say because the drive names have to switch which isn't guaranteed even if one switches cables between the two. I'll have a go again myself with the minimum number of drives and report back. I'd kept with the system that reproduced this as my other machines are more stable re- drive name changes.

Re the indicated commit I didn't think that it refreshed the drives themselves, ie it doesn't seem to invoke scan_disks which is what would be needed prior to using any db entry as otherwise they are from a previous boot. I have most probably missed something but I'll have another stab at reproducing it in a more reliable manner.

Cheers.

ajkWare commented 8 years ago

Just for the record, I see this problem regularly. Currently, I use this system to secure back-ups (well synchronised copies really) weekly, so switch the machine (old HP Microserver N54L) on to do this. The shares don't show up about 50% of the time. (First time it happened, had a minor panic attack!) There are three discs plugged into the system. If there's anything you'd like to know to help with diagnosis, I'm happy to give it a (noobie) go.
Cheers Andy

phillxnet commented 8 years ago

Linking to a recent pr that looks like it may help with this, introduced in 3.8-12.06.

1234

phillxnet commented 8 years ago

For context on possible development with this issue it may be that the pending changes in pr #1338 #1357 for issue #1320 may address the root cause in this issue, which is now rather difficult to reproduce. After the above pr has been reviewed and if appropriate committed I can have another look at this issue as then all db maintained names will have moved to by-id (boot to boot stable) names. This I think will either fix this issue or at least provide richer boot to boot stable names in the logs for all devices with by-id type names available.

phillxnet commented 8 years ago

@ajkajk Sorry not to have replied here earlier, it would seem you have quite a choice setup where this issue is concerned. Do you still have such a high incidence rate? And are you running Stable channel updates or Testing channel? Thanks for your offer to help with this one. If in fact pr #1338 #1357 bears positively on this issue then it would be great if you could let us know if things improve for you after that on you problematic system. The pull request is due to be reviewed and hopefully included towards the beginning of the next testing channel cycle so after the release of version 3.8-14 stable channel update. I will update this issue on it's successful merge and release, if all goes well, so that we might re-address this issue there after. Thanks.

phillxnet commented 8 years ago

Linking to a recent forum post where @Stefan may have a system that reliable reproduces what looks like this issue. Please update the following forum thread with significant development on this issue: https://forum.rockstor.com/t/rockstor-3-8-14-not-mounting-shares-subvolume/1693

ajkWare commented 8 years ago

@phillxnet - I'm not able to access the server currently. It was still exhibiting the problem 20 Apr when I last had access to it, using the then current test release. I hope to be back mid-July and will be in a position to try stuff then. Cheers Andy

phillxnet commented 8 years ago

@ajkajk Thanks for the update, I am hoping the pending pull request successfully addresses this issue so it would be helpful to have your feedback on if it works for your particular setup. Thanks. This issue will be updated on the indicated changes being released.

phillxnet commented 8 years ago

@ajkajk Just a heads up that the code changes in pull request #1357 have now been merged and released via Rockstor version 3.8-14.02 testing channel updates. Hope it works for you when you get back. Feedback either way on if these changes have sorted your problem would be appreciated. I know you don't have access to this machine until mid-July but just noting what is hopefully a fix for at least one of the causes of this issue has now been release.

phillxnet commented 8 years ago

This issue is awaiting confirmation of what is hopefully a fix released as of Rockstor version 3.8-14.02 testing channel updates. The issue will remain open for the time being until the released fix has been confirmed to address the problem.

ajkWare commented 7 years ago

@phillxnet Finally back and able to access my Rockstor NAS. Performed 15 reloads using the software on it (which doesn't include your changes), mainly by power-cycling. 4 of these reloads resulted in the shares not being visible. Updated to release 3.8-13.12. Haven't seen the problem since after 15 further loads. So I think we can close this issue.

Cheers Andy

phillxnet commented 7 years ago

@ajkajk thanks for the testing feedback: much appreciated and invaluable. I'm glad it's working better for you. Will probably move to closing the issue shortly.

phillxnet commented 7 years ago

@ajkajk Just to confirm your current updated version number; in your last post in this issue you state 3.8-13.12 (just noticed this), I'm hoping you meant 3.8.14.13 or possibly 3.8-14.11 which are our last two testing release versions. Anything greater than 3.8.14.02 testing channel updates has the proposed fix in. Thanks and sorry I missed this earlier.

ajkWare commented 7 years ago

@phillxnet Yes - sorry - Don't panic - I updated from 3.8-13.12 to 3.8.14, which is the latest testing version at the moment according to the system update page. I misread my notes :-( (Incidentally, the upgrade failed because my thumb-drive was full of old versions and I had to reinstall to get to the later version - I don't know if the old kernels lying around should be housekept by me or by the update process?)

phillxnet commented 7 years ago

@ajkajk OK, that a relief thanks.

Re the update, not sure what happened there. There is a mechanism to remove the old kernels (see issue #1068 and pr #1088 by @schakrava ) so that /boot doesn't fill up so probably not that partition but I think there might still be an issue with / filling up in some instances. An issue for another issue I suspect.

Cheers and thanks for checking / reporting the version numbers.

phillxnet commented 7 years ago

Closing as fixed by pr #1357 for issue #1320 which removed the use of non boot safe names from within the db, there by also removing intermittent initial boot sequence race conditions that were to blame for this issue. Note however that for devices with no serial number there is still a fail over to non-boot safe names within the db: this is due to udevs reliance on serial numbers for it's by-id symlink creation. The user is clearly informed of this situation within the Web-UI and any further refinement can be more specifically addressed in future issues.

Thanks to @ajkajk for final confirmation of fix.