Merge p file functionality

jessecusack commented 1 year ago

@mousebrains, I understand the need for the merge p file functionality. I would like to understand how this works given that it is not always required. How much overhead does it add to the code to run every time? Does it cache the results for future speed ups? It could be made optional if there is significant overhead.

https://github.com/jessecusack/perturb/blob/main/Code/merge_all_p_files_in_directory.m

mousebrains commented 1 year ago

For my test case, ~6GB and 280 P files, the check, without merging is ~2 seconds on my fully local MacBook file system. I have not tested it on a networked drive.

As of now, there is no caching, but it is on my list to look at. Given the ~2 second overhead I have not done anything yet.

jessecusack commented 1 year ago

Ok, that is great. Reading through the code I am getting a vague understanding of what it does, but much is still unclear. Could you clarify:

Does it rely on a specific layout of the p file binary data? If the p files are changed in the future, might this function become obsolete?
Does it delete/change/create new p files? (What are 'bad' files?)

I am concerned about the integrity of original raw data and possible dependence on Rockland file specs. I like to leave original data that comes off the instrument completely untouched (no renaming of files or moving data around).

mousebrains commented 1 year ago

It relies on the specific layout of P file in terms of the overall well defined structure, 128 byte header and a variable length payload, with the first payload being the configuration ASCII string. This structure has been stable for years and is defined in Rockland's Technical Note 0051.

Rockland will add some more content into the unused header record fields, but it will still be 128 bytes. They are looking at simplifying the rollover file detection method in the next iteration of the software, which will enable even faster detection. That will be a different check, based on the version of the P file specified in the header version field. This function will not become obsolete, but rather it will need enhancement with the updated detection method.

I'm insanely careful about the integrity of the data. The steps are:

The P files to be merged are merged into a temporary file.
If successful: A. The original P files are renamed to the same filename with a .orig suffix, as suggested by Rockland. B. If this is successful: a. The temporary file is rename to the first file in the list. i) If successful:
- move onto the next files to be merged. ii) If not successful:
- Restore all the .orig files to not having .orig
- Remove the temporary file
- Move onto the next files to be merged C. If (2.A) is not successful: a. Any .files moved to .orig are restored b. Remove the temporary file c. Move onto the next files to be merged
If not successful: A. Remove the temporary file B. Move onto the next files to be merged

Yes, I'm sure there is a thread for data corruption, but it is very tiny. That being said, this is the basic method Rockland is already using when you call odas_p2mat and there are bad blocks in a file, the file is moved to _original.p and an updated .p file is written in place.

I typically do the following with my data: a) I create a read-only copy of what came off the instrument. b) I create a working copy where I run my script on.

mousebrains commented 1 year ago

@jessecusack Can we close this?

jessecusack commented 1 year ago

I still have a bad feeling about this workflow. When I think of data integrity, I am thinking both of the corruption aspect and also the file attributes and metadata. To me the raw data directory should be a 1:1 copy of exactly what comes off the instrument, nothing more or less, and any software working with that should not modify it in any way. The current workflow breaks this integrity because it renames raw files and creates new ones that were not originally on the instrument. I understand that your workaround for this is to have a read-only repository of original data that is not touched by the code. This is not a bad idea, but it is prone to human error (I understand that all workflows are to some extend). Someone at sea may forget to follow this procedure.

In my ideal world, the merging processing would create a new directory to store the merged files and would not rename the the original files. The code would then search the merged directory and have a preference for the merged data over the original, if it exists. Would this be feasible to create?

mousebrains commented 1 year ago

My philosophy is to never work on the master copy directly! Yes, many people don't follow this! I've been focused on code simplicity and transparency.

Here are three methods for what you're describing:

We make a complete copy of the raw data, then only modify the copy. *This essentially automates my method of working on a copy of the master data. This entails a significant amount of file writing and a corresponding time overhead. The overhead can be minimized by using a mtime/size check for doing the copy or not. On inx systems, symbolic links are an option, but on older NTFS filesystems, all FAT, and many SMB mount points, symbolic links don't work.**
We make a windowed file directory structure, where the raw data is the back pane, and the modified data is in front of that. This minimizes excess writes, but adds to code complexity.
We add another database that tells the P file -> mat conversion to merge the required P files prior to generating the mat file, and to skip some P file conversions. This adds to code complexity, and is less transparent to me.

jessecusack commented 1 year ago

What do you mean by a windowed file directory structure?

mousebrains commented 1 year ago

The concept comes from the 1980s. I first learned it in GNU Make in the 1980s.

Think of a stack of directories to look in, shallowest to deepest. The deepest would be the raw directory. Shallower directories would be more processed than the deeper directories. It is like looking through a series of window panes.

Functionally the software knows the directory stack. When looking for a file it descends the directory stack from shallowest to deepest.

jessecusack commented 1 year ago

Hmmm ok. I think that I prefer your 3rd option or some variation on it and I thought of another reason why. If I accidentally run the code with "p_file_merge", true (which I just did) and then want to run it with "p_file_merge", false, do I still end up using the merged file? Is there an undo functionality? If there is a database, it can be ignored in the event that parameters change. If merged p files are created but kept in a separate directory, that directory can be ignore if processing parameters change.

mousebrains commented 1 year ago

Yes, you're correct, if you run with p_file_merge=true then change it to false, you'll currently be using the merged p_files, unless you manually rename the .p.orig files to .p.

I'll look at the database option.

mousebrains commented 11 months ago

Cleaned up in refactorization

jessecusack / perturb

Merge p file functionality #7