Open berntpopp opened 2 months ago
MD5 Checksum Mechanism:
md5sum.txt
file within the relevant folders to enable checking for existing results before recalculating.Reuse of Intermediate Files:
intermediate
subfolder within the plasmid input directory. This ensures they can be reused across different runs.File Naming and Uniqueness:
Avoid Redundant MD5 Calculations:
Copying Files to Output Folder:
Pipeline Functionality:
run_pipeline
function now supports the reuse of intermediate files based on MD5 checksums. However, ensure that the checksums are stored and retrieved correctly for each run to maintain efficiency.plasmid_intermediate_folder
is correctly used to store intermediate files, but care should be taken to avoid overwriting files unless explicitly requested.Utils Functionality:
utils.py
(such as calculate_md5
, write_md5sum
, load_md5sum
, and check_md5sum
) provide the foundation for the MD5-based reuse mechanism. Ensure these functions are well-tested and handle edge cases, such as missing or corrupted files.copy_file_to_folder
function ensures that intermediate files are copied to the output directory. This is essential for debugging and downstream analyses.File Naming:
run_pipeline.py
uses a combination of the human reference, plasmid, and sequencing file base names to generate unique file names. This approach should work well to avoid file conflicts.Testing:
Performance Optimization:
Documentation:
Feedback and Iteration:
Description: Extend the existing mechanism for reusing indices based on md5sums to also reuse spliced alignment information. By calculating and storing the md5sums of the plasmid and reference files, the pipeline can determine if the spliced alignment has already been performed for a given combination and reuse the existing output. This will save computational resources and time, especially for large datasets.
Tasks:
Benefits:
Related Issues:
39
12