nasa / CF

The Core Flight System (cFS) CFDP application.
Apache License 2.0
79 stars 45 forks source link

Likely issues in processing of deferred/reordered "md" packets #131

Closed jphickey closed 2 years ago

jphickey commented 2 years ago

There appears to be logic within CF to support a case where the MetaData PDU arrives late - that is, after file data has already started. Basically, the logic is as follows:

The problem that is immediately visible here is that this close+reopen will reset the file position, but does not reset the cached position within the state object (t->state_data.r.cached_pos) that is used during writes to determine if an "lseek" is necessary. So now the file pointer is back to the beginning, but the cached_pos likely reflects the end of the file, and as a result new data will clobber existing content.

But the bigger issue is that this code is essentially duplicate/specialized logic that only handles a corner case/exception so it is unlikely to be adequately tested.

skliper commented 2 years ago

This is a pretty significant bug... really should fix this round and add a functional test with at least one out-of-order scenario. Possible/brainstorming scenario that would mimic com available mid transfer - send the last half of a file in at least 2 PDU's and FIN, confirm reception of metadata PDU NAK and file data NAK, send metadata PDU and EOF, confirm reception of file data NAK, send first half of file data (at least 2 PDU's) and EOF, confirm FIN and checksum passes.

@semaldona Might be a good ctf test to add this next round (depending on guidance from whoever is setting priorities)

EDIT - see below.. I probably don't quite have the protocol correct.

skliper commented 2 years ago

Above scenario wouldn't cause the issue... instead add another file data PDU destined for the end of the file after metadata such that cached location won't match current file location (but cached location will match the FD PDU offset). The lseek should be skipped and that data should get written to the wrong location if the bug exists.

semaldona commented 2 years ago

A CTF based test was developed that verifies the issue exists. The test shows that the bug prevents a file, sent via out-of-sequence data and metadata PDUs, from completing the CRC verification check. As currently implemented, CF will issue error event 86 with output similar to the following: "CF R1(23:2): failed to read file expected 560, got 480". Because the cached position within the CF state object (t->state_data.r.cached_pos) is not reset after the temporary file rename/reopen, the out of order data PDUs are not properly written to the file. Another processing issue is that the call to OS_rename will not work when the path to the table configured directory that stores the temporary file is different from the file destination path. An overall fix would involve:

  1. changing OS_rename to OS_mv
  2. setting the cached position to zero just after reopening the file created in the OS_mv operation
jphickey commented 2 years ago

A few thoughts on this ... ideally one would want to configure the temporary holding directory to be in the same filesystem as the final destination file, such that OS_rename does work as intended, because this is much faster (and arguably safer) than copying and removing the file.

I would prefer to see this pattern followed for all transfers, not just those for which the MD arrives out of order. That is, even for the normal case where the MD PDU does arrive first, consider opening the file with a temporary name anyway, and renaming it to the final name only after everything is received correctly and the CRC matches.

Doing it this way has several advantages:

In other words, lots of advantages to always doing the temp name followed by rename, not just relegating it to a corner case.

skliper commented 2 years ago

@jphickey - OS_mv does the rename where possible (faster) and supports the use case of tmp in RAM and final file in CF (or whatever persistent storage you want) to keep table uploads and similar use cases simple. I don't see anything wrong with supporting this pattern. Hopefully the user understands the performance hit, and if not we can add a documentation note. Making it such that you can only transfer files to one file system (the one where tmp is) seems really restrictive... and since the tmp_dir is an engine level config element it really does limit functionality.

The rest of your points I agree w/ completely, but as a separate issue. This PR is narrowly focused on fixing the out of order MD. I'd like to address changing to use a temp file for all receives independently.

jphickey commented 2 years ago

See issue #280 as the follow on change to do this for all transfers.

I concur that using OS_mv will make it more flexible/generic, but it is probably worth a documentation mentioning that there will likely be a (possibly very significant) performance penalty if the temporary location is not in the same underlying filesystem as the destination file.