Closed jessecusack closed 11 months ago
For my test case, ~6GB and 280 P files, the check, without merging is ~2 seconds on my fully local MacBook file system. I have not tested it on a networked drive.
As of now, there is no caching, but it is on my list to look at. Given the ~2 second overhead I have not done anything yet.
Ok, that is great. Reading through the code I am getting a vague understanding of what it does, but much is still unclear. Could you clarify:
I am concerned about the integrity of original raw data and possible dependence on Rockland file specs. I like to leave original data that comes off the instrument completely untouched (no renaming of files or moving data around).
It relies on the specific layout of P file in terms of the overall well defined structure, 128 byte header and a variable length payload, with the first payload being the configuration ASCII string. This structure has been stable for years and is defined in Rockland's Technical Note 0051.
Rockland will add some more content into the unused header record fields, but it will still be 128 bytes. They are looking at simplifying the rollover file detection method in the next iteration of the software, which will enable even faster detection. That will be a different check, based on the version of the P file specified in the header version field. This function will not become obsolete, but rather it will need enhancement with the updated detection method.
I'm insanely careful about the integrity of the data. The steps are:
Yes, I'm sure there is a thread for data corruption, but it is very tiny. That being said, this is the basic method Rockland is already using when you call odas_p2mat and there are bad blocks in a file, the file is moved to _original.p and an updated .p file is written in place.
I typically do the following with my data: a) I create a read-only copy of what came off the instrument. b) I create a working copy where I run my script on.
@jessecusack Can we close this?
I still have a bad feeling about this workflow. When I think of data integrity, I am thinking both of the corruption aspect and also the file attributes and metadata. To me the raw
data directory should be a 1:1 copy of exactly what comes off the instrument, nothing more or less, and any software working with that should not modify it in any way. The current workflow breaks this integrity because it renames raw files and creates new ones that were not originally on the instrument. I understand that your workaround for this is to have a read-only repository of original data that is not touched by the code. This is not a bad idea, but it is prone to human error (I understand that all workflows are to some extend). Someone at sea may forget to follow this procedure.
In my ideal world, the merging processing would create a new directory to store the merged files and would not rename the the original files. The code would then search the merged directory and have a preference for the merged data over the original, if it exists. Would this be feasible to create?
My philosophy is to never work on the master
copy directly! Yes, many people don't follow this! I've been focused on code simplicity and transparency.
Here are three methods for what you're describing:
raw
data, then only modify the copy.
*This essentially automates my method of working on a copy of the master data. This entails a significant amount of file writing and a corresponding time overhead. The overhead can be minimized by using a mtime/size check for doing the copy or not. On inx systems, symbolic links are an option, but on older NTFS filesystems, all FAT, and many SMB mount points, symbolic links don't work.**raw
data is the back pane, and the modified
data is in front of that.
This minimizes excess writes, but adds to code complexity.What do you mean by a windowed file directory structure?
The concept comes from the 1980s. I first learned it in GNU Make in the 1980s.
Think of a stack of directories to look in, shallowest to deepest. The deepest would be the raw
directory. Shallower directories would be more processed than the deeper directories. It is like looking through a series of window panes.
Functionally the software knows the directory stack. When looking for a file it descends the directory stack from shallowest to deepest.
Hmmm ok. I think that I prefer your 3rd option or some variation on it and I thought of another reason why. If I accidentally run the code with "p_file_merge", true
(which I just did) and then want to run it with "p_file_merge", false
, do I still end up using the merged file? Is there an undo functionality? If there is a database, it can be ignored in the event that parameters change. If merged p files are created but kept in a separate directory, that directory can be ignore if processing parameters change.
Yes, you're correct, if you run with p_file_merge
=true then change it to false, you'll currently be using the merged p_files, unless you manually rename the .p.orig files to .p.
I'll look at the database option.
Cleaned up in refactorization
@mousebrains, I understand the need for the merge p file functionality. I would like to understand how this works given that it is not always required. How much overhead does it add to the code to run every time? Does it cache the results for future speed ups? It could be made optional if there is significant overhead.
https://github.com/jessecusack/perturb/blob/main/Code/merge_all_p_files_in_directory.m