marbl / SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
MIT License
177 stars 47 forks source link

added -O for original coordinates agp #161

Closed pickettbd closed 2 years ago

pickettbd commented 2 years ago

Added -O for original coordinates agp, which required adding -b and -G to get_seq.py. Refactored part of get_seq.py to simplify string building and use consistent whitespace. Removed trailing tab in output agp. Also updated readme.

The differences in get_seq.py probably look larger than they are because I used with statements when opening some of the files which changed the indentation levels. I tested my refactoring independently of the changes to add the original coordinates functionality.

Summary of functional changes: Supplying the -O flag to run_pipeline.py will cause get_seq.py to be called with its new -b and -G options. -b provides input (the input_breaks file) and -G tells it where to write the extra AGP file with the original coordinates (as opposed to the coordinates from assembly.cleaned.fasta). Since the extra output filename is controlled by run_pipeline.py and not directly exposed to the user, the output names match the pattern: <name>.original-coordinates.agp, where <name> is scaffolds_FINAL at the end of the pipeline and scaffolds_ITERATION_# in intermediate steps (if -p was supplied to run_pipeline.py). The regular output (e.g., scaffolds_FINAL.agp) is still output and is identical regardless of whether the user supplies the new -O flag.

Run times are not significantly affected by this change despite the extra work, presumably because the extra computation is trivial, the extra files are small, and the refactoring prevents the repeated copying of strings.