hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
44 stars 8 forks source link

setup_genome_sequences fails if rerunning from the "cat" step #54

Open ning-y opened 6 months ago

ning-y commented 6 months ago

This is a low importance issue, because the use case is rare and the workaround is easy.

If make_lastz_chains is called with --continue_from_step cat, but the "cat" step has been run before, make_lastz_chains will fail with e.g.

### Trying to continue from step: cat
Making chains for /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/200-genomes/hg38.2bit and /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/210-repeatmasked/GCA_004027375.1_MacSob_v1_BIUU_genomic.2bit files, saving results to /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/300-chain/hg38/GCA_004027375.1_MacSob_v1_BIUU_genomic/out
Pipeline started at 2024-02-28 12:56:25.148054
 * Setting up genome sequences for target
genomeID: hg38
input sequence file: /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/200-genomes/hg38.2bit
is 2bit: True
planned genome dir location: /data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/300-chain/hg38/GCA_004027375.1_MacSob_v1_BIUU_genomic/out/target.2bit
Traceback (most recent call last):
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/./make_chains.py", line 261, in <module>
    main()
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/./make_chains.py", line 257, in main
    run_pipeline(args)
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/./make_chains.py", line 233, in run_pipeline
    setup_genome_sequences(args.target_genome,
  File "/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/100-install/make_lastz_chains/modules/project_setup_procedures.py", line 172, in setup_genome_sequences
    os.symlink(arg_input_2bit, seq_dir)
FileExistsError: [Errno 17] File exists: '/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/200-genomes/hg38.2bit' -> '/data/wanglf/home/e0175719/runs/toga-pipeline-1.2/inter/300-chain/hg38/GCA_004027375.1_MacSob_v1_BIUU_genomic/out/target.2bit'

Because the "target.2bit" file was symlinked from the earlier run, and the os.symlink fails if the destination already exists.

The user workaround is to delete "target.2bit" which fixes this easily.

The developer fix is to check and remove the destination if it exists, or use a symbolic linking function which tolerates already existing destinations. Checking briefly, the os.symlink function does not have this option.

MichaelHiller commented 6 months ago

Thanks for reporting this. The correct behavior should be to NOT overwrite results that already exists (to prevent that somebody executes the pipe accidentally again). If the cat step was already run (either successfully, or not), then the user should clean it up and then continue with cat.

However, I agree that the script should output a proper warning + what the user needs to do. And the symlinks of the 2bit is likely not a stable solution. Bogdan, can you pls have a look at some point?

Thx a lot