nasa / osal

The Core Flight System (cFS) Operating System Abstraction Layer (OSAL)
Apache License 2.0
546 stars 214 forks source link

cFS fails to start if executed off of a large NFS home directory #762

Open jhnphm opened 3 years ago

jhnphm commented 3 years ago

Describe the bug If cFS is executed off of an NFS home directory, an output similar to the following is created:

OS_BSP_Initialize():Maximum user msg queue depth = 1024
CFE_PSP: Default Reset SubType = 1
CFE_PSP: Default CPU ID = 8
CFE_PSP: Default Spacecraft ID = 42
CFE_PSP: Default CPU Name: ciu
OS_Posix_GetSchedulerParams():190:Policy 1: available, min-max: 1-99
OS_Posix_GetSchedulerParams():190:Policy 2: available, min-max: 1-99
OS_Posix_TaskAPI_Impl_Init():376:Selected policy 2 for RT tasks, root task = 99
OS_Posix_TaskAPI_Impl_Init():392:Could not setschedparam in main thread: Operation not permitted (1)
CFE_PSP: OS_FileSysAddFixedMap() failure: -106
CFE_PSP: Starting the cFE with a POWER ON reset.
CFE_PSP: Clearing out CFE CDS Shared memory segment.
CFE_PSP: Clearing out CFE Reset Shared memory segment.
CFE_PSP: Clearing out CFE User Reserved Shared memory segment.
1980-079-14:44:45.09866 POWER ON RESET due to Power Cycle (Power Cycle).
1980-079-14:44:45.09868 ES Startup: CFE_ES_Main in EARLY_INIT state
CFE_PSP: CFE_PSP_AttachExceptions Called
1980-079-14:44:45.09887 ES Startup: CFE_ES_Main entering CORE_STARTUP state
1980-079-14:44:45.09888 ES Startup: Starting Object Creation calls.
1980-079-14:44:45.09888 ES Startup: Calling CFE_ES_CDSEarlyInit
1980-079-14:44:45.09899 ES Startup: Calling CFE_EVS_EarlyInit
1980-079-14:44:45.09901 Event Log cleared following power-on reset
1980-079-14:44:45.09902 ES Startup: Calling CFE_SB_EarlyInit
1980-079-14:44:45.09903 ES Startup: Calling CFE_TIME_EarlyInit
1980-012-14:03:20.00000 ES Startup: Calling CFE_TBL_EarlyInit
1980-012-14:03:20.00005 ES Startup: Calling CFE_FS_EarlyInit
EVS Port1 42/8/CFE_EVS 1: cFE EVS Initialized. cFE DEVELOPMENT BUILD v6.8.0-rc1+dev236 (Codename: Bootes), Last Official Release: cfe v6.7.0
1980-012-14:03:20.15048 SB:Registered 4 events for filtering
EVS Port1 42/8/CFE_SB 1: cFE SB Initialized
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_SB
EVS Port1 42/8/CFE_ES 1: cFE ES Initialized
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_ES
EVS Port1 42/8/CFE_ES 2: cFS Versions: cfe v6.8.0-rc1+dev236, osal v5.1.0-rc1+dev184, psp v1.5.0-rc142. cFE chksm 6317
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_ES
EVS Port1 42/8/CFE_ES 91: Mission demo4-128-g78d3e72-dirty.HALO, CFE git version: demo3-210-g1ede295, OSAL git version: demo3-182-g957e598-dirty
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_ES
EVS Port1 42/8/CFE_ES 92: Build 202101150523 phamj@va31spax062h.atk.com
EVS Port1 42/8/CFE_TIME 1: cFE TIME Initialized
EVS Port1 42/8/CFE_TBL 1: cFE TBL Initialized. cFE DEVELOPMENT BUILD v6.8.0-rc1+dev236 (Codename: Bootes), Last Official Release: cfe v6.7.0
1980-012-14:03:20.50107 ES Startup: Finished ES CreateObject table entries.
1980-012-14:03:20.50108 ES Startup: CFE_ES_Main entering CORE_READY state
1980-012-14:03:20.50108 ES Startup: Error, Can't Open ES App Startup file: /cf/cfe_es_startup.scr EC = 0xFFFFFF94
1980-012-14:03:20.50108 ES Startup: CFE_ES_Main entering APPS_INIT state
1980-012-14:03:20.50109 ES Startup: CFE_ES_Main entering OPERATIONAL state
EVS Port1 42/8/CFE_TIME 21: Stop FLYWHEEL
^C1980-012-14:05:59.69445 ExceptionID 0x1110000 in TaskID 0: Caught SIGINT
1980-012-14:05:59.69446 Processor Reset count not reached (0/2)
CFE_PSP: Exiting cFE with PROCESSOR Reset status.

CFE_PSP: Shutdown initiated - Exiting cFE
OS_FileSysMountVolume_Impl():251:ERROR: Cannot create mount point ./cf: File exists

To Reproduce Execute Linux cFS binary off of an NFS home directory

Expected behavior cFS reads cfe_es_startup.scr successfully

Code snips https://github.com/nasa/osal/blob/bfca5b21f5fc2032505256fc67ae974f53a2a319/src/os/posix/src/os-impl-filesys.c#L247-L254

System observed on: Linux (RHEL7), OSAL 5.1.0-rc1+dev184

Reporter Info John N Pham, Northrop Grumman

skliper commented 3 years ago

Hi @jhnphm - Are you changing anything in the PSP, or is this the default Status = OS_FileSysAddFixedMap(&fs_id, "./cf", "/cf");? Seems strange it would say trying to create mount point ./cf since it's mapping to /cf (at least for the default setup).

Note - we use NSF mounted home for our development VMs (Ubuntu 18.04) without an issue.

jphickey commented 3 years ago

@jhnphm -- just to confirm - you are starting from the CFE core from the installation location, right? That is, the current working directory is the directory containing the core executable - and you are starting it as ./core-cpu1 (or whatever your CPU name is). Runtime paths become relative to the working directory.

This is because the "pc-linux" PSP (at least by default) uses the current working dir as its reference point, which simplifies running it on development hosts - to avoid requiring any absolute paths to be configured. Hence why it is trying to use ./cf as the system path (relative) as opposed to an absolute path.

Normally the system should not be trying to create the /cf mount point - this should already exist because files should have been installed here (cfe_es_startup.scr, all your app files, tables, etc). The fact that mkdir is failing I think is secondary here - it should not be trying to make this dir on the fly in the first place (that is really just for RAM disks and such).

Finally - although I do think its a secondary issue - can you possibly capture what the errno value is after stat() but before the mkdir() call? i.e. add another OS_DEBUG macro just like the other one - but before it calls mkdir. Based on the evidence here it looks like the stat() call returned an error, but it was an error for a reason other than the directory not existing. The code (perhaps incorrectly?) assumes if stat doesn't work, it is because the directory doesn't exist. There could be something else we need to consider at that point.

jhnphm commented 3 years ago

I'm using the default PSP for Linux w/o changes. W/ a OS_DEBUG after stat:

OS_BSP_Initialize():Maximum user msg queue depth = 1024
CFE_PSP: Default Reset SubType = 1
CFE_PSP: Default CPU ID = 8
CFE_PSP: Default Spacecraft ID = 42
CFE_PSP: Default CPU Name: ciu
OS_Posix_GetSchedulerParams():190:Policy 1: available, min-max: 1-99
OS_Posix_GetSchedulerParams():190:Policy 2: available, min-max: 1-99
OS_Posix_TaskAPI_Impl_Init():376:Selected policy 2 for RT tasks, root task = 99
OS_Posix_TaskAPI_Impl_Init():392:Could not setschedparam in main thread: Operation not permitted (1)
OS_FileSysMountVolume_Impl():249:statdir: ./cf, errno: Value too large for defined data type    #<-- This is new
CFE_PSP: OS_FileSysAddFixedMap() failure: -106
CFE_PSP: Normal exit from previous cFE instance
CFE_PSP: Starting the cFE with a PROCESSOR reset.
1980-080-06:58:40.38351 ES Startup: CFE_ES_Main in EARLY_INIT state
1980-080-06:58:40.38355 Volatile Disk has 99 Percent free space.
CFE_PSP: CFE_PSP_AttachExceptions Called
1980-080-06:58:40.38373 ES Startup: CFE_ES_Main entering CORE_STARTUP state
1980-080-06:58:40.38373 ES Startup: Starting Object Creation calls.
1980-080-06:58:40.38374 ES Startup: Calling CFE_ES_CDSEarlyInit
1980-080-06:58:40.38376 ES Startup: Calling CFE_EVS_EarlyInit
1980-080-06:58:40.38377 Event Log restored, n=0, c=20, f=1, m=1, o=6
1980-080-06:58:40.38378 ES Startup: Calling CFE_SB_EarlyInit
1980-080-06:58:40.38380 ES Startup: Calling CFE_TIME_EarlyInit
1980-012-14:03:20.00000 ES Startup: Calling CFE_TBL_EarlyInit
1980-012-14:03:20.00005 ES Startup: Calling CFE_FS_EarlyInit
EVS Port1 42/8/CFE_EVS 1: cFE EVS Initialized. cFE DEVELOPMENT BUILD v6.8.0-rc1+dev236 (Codename: Bootes), Last Official Release: cfe v6.7.0
1980-012-14:03:20.15041 SB:Registered 4 events for filtering
EVS Port1 42/8/CFE_SB 1: cFE SB Initialized
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_SB
EVS Port1 42/8/CFE_ES 1: cFE ES Initialized
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_ES
EVS Port1 42/8/CFE_ES 2: cFS Versions: cfe v6.8.0-rc1+dev236, osal v5.1.0-rc1+dev184, psp v1.5.0-rc142. cFE chksm 48366
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_ES
EVS Port1 42/8/CFE_ES 91: Mission demo4-128-g78d3e72-dirty.HALO, CFE git version: demo3-210-g1ede295, OSAL git version: demo3-182-g957e598-dirty
EVS Port1 42/8/CFE_SB 14: No subscribers for MsgId 0xa1f,sender CFE_ES
EVS Port1 42/8/CFE_ES 92: Build 202101151408 phamj@va31spax062h.atk.com
EVS Port1 42/8/CFE_TIME 1: cFE TIME Initialized
EVS Port1 42/8/CFE_TBL 1: cFE TBL Initialized. cFE DEVELOPMENT BUILD v6.8.0-rc1+dev236 (Codename: Bootes), Last Official Release: cfe v6.7.0
OS_FileSysMountVolume_Impl():252:ERROR: Cannot create mount point ./cf: File existsOS_FileOpen_Impl():114:open(/dev/shm/osal:RAM/cfe_es_startup.scr): No such file or directory
1980-012-14:03:20.50089 ES Startup: Finished ES CreateObject table entries.
1980-012-14:03:20.50090 ES Startup: CFE_ES_Main entering CORE_READY state
1980-012-14:03:20.50093 ES Startup: Cannot Open Volatile Startup file, Trying Nonvolatile.
1980-012-14:03:20.50094 ES Startup: Error, Can't Open ES App Startup file: /cf/cfe_es_startup.scr EC = 0xFFFFFF94
1980-012-14:03:20.50094 ES Startup: CFE_ES_Main entering APPS_INIT state
1980-012-14:03:20.50094 ES Startup: CFE_ES_Main entering OPERATIONAL state
EVS Port1 42/8/CFE_TIME 21: Stop FLYWHEEL
jphickey commented 3 years ago

Suggestion - Can you try adding this:

add_definitions(-D_LARGEFILE_SOURCE)

to your arch_build_custom.cmake file? (This is under your "defs" directory). Then I recommend doing a full rebuild. This might be due to the large size of your NFS file system.

jhnphm commented 3 years ago

Hm, that didn't fix it. add_definitions(-D_FILE_OFFSET_BITS=64) did though. Thanks!

the-other-james commented 2 years ago

I just ran into this bug as well. Fixed by adding -D_FILE_OFFSET_BITS=64, but error from stat() was different ("Value too large for defined data type").

As the code is now, I think the assumption that stat failed because the directory doesn't exist is leading to confusing debug messages. Ideally the code would report the errno from the stat call before attempting mkdir.