hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
46 stars 8 forks source link

A minor discrepancy between v1.0.0 and v2.0.8 at 'fill_chain' step #67

Open ohdongha opened 2 days ago

ohdongha commented 2 days ago

Hi again,

Recently, we have been testing make_lastz_chains v2.0.8 (MLC v2) to run on our SGE grid. After adding some clusterOptions directives to the Nextflow template (execute_joblist.nf), the MLC v2 pipeline runs well.

However, the alignment results have always been slightly different from those with make_lastz_chains v1.0.0 (MLC v1), so we compared the temporary job scripts from each step.

One difference is that MLC v1 uses --chainMinScore 25000 when running chain_gap_filler.py during the "fill_chain" step, while MLC v2 uses --chainMinScore 1000 or any other value given with --min_chain_score when setting up the pipeline.

MLC v2 accepts --fill_chain_min_score separately, with a default value 25000.

But in the fill_chain_step.py code, it uses param.chain_min_score instead of param.fill_chain_min_score (#24) when building job scripts that run chain_gap_fillter.py:

$ grep -n -B8 -A6 chainMinScore ./make_lastz_chains-2.0.8/steps_implementations/fill_chain_step.py
16-def create_repeat_filler_joblist(params: PipelineParameters,
17-                                 project_paths: ProjectPaths,
18-                                 executables: StepExecutables):
19-    to_log("Creating repeat filler jobs list")
20-    infill_chain_filenames = os.listdir(project_paths.fill_chain_jobs_dir)
21-    to_log(f"fGot {len(infill_chain_filenames)} chain files to fill")
22-    lastz_parameters = f"\"K={params.fill_lastz_k} L={params.fill_lastz_l}\""
23-    repeat_filler_params = [
24:        f"--chainMinScore {params.chain_min_score}",
25-        f"--gapMaxSizeT {params.fill_gap_max_size_t}",
26-        f"--gapMaxSizeQ {params.fill_gap_max_size_q}",
27-        f"--scoreThreshold {params.fill_insert_chain_min_score}",
28-        f"--gapMinSizeT {params.fill_gap_min_size_t}",
29-        f"--gapMinSizeQ {params.fill_gap_min_size_q}",
30-    ]

I wonder if this is intended.

Replacing param.chain_min_score with param.fill_chain_min_score appears to reduce the number of final alignments slightly (after post-processing) without affecting the alignment coverage of CDS.

...

There are also differences in how the target and query sequences were chunked and how sequences smaller than the chunk size were treated during the lastz step, but for this, I think what v2 does makes more sense than v1. :)

Another difference is handling the lastz_q (or BLASTZ_Q) parameter during the "chain_run" step. I will write about this in another issue.

MichaelHiller commented 1 day ago

Thx for reporting this. I just checked and we use a minScore of 25000 for all alis that we generate these days. But we have used 5000 before. I can't recall why we changed this and whether it mattered much.

Chains with a score <25000 are typically small and for most well-assembled genomes chains score 1+ million.

If you want to maximize sensitivity, pls set the threshold to 5000 or lower.

ohdongha commented 1 day ago

Thx for reporting this. I just checked and we use a minScore of 25000 for all alis that we generate these days. But we have used 5000 before.

Thanks for sharing this. For "chain_run" step, we have been using the default minScore 1000 (both MLC v1 and v2 have 1000 as the default) or 5000 when the scoring matrix is HoxD55. If increasing the default to 25000 during "chain_run" step does not lose much of the alignment CDS coverage while removing spurious and small alignments, that may work for our purpose as well. I will give it a try.

The score I mentioned in the post above is the default values for "fill_chain" step. MLC v1 used chain_gap_filler.py --chainMinScore 25000 while MLC v2 had chain_gap_filler.py --chainMinScore 1000 by default, reusing the same parameter used in "chain_run" step. If we are going to use the same value for the "chain_run" and "fill_chain" steps (e.g., 25000 for both steps), then the current v2.0.8 code is correct.

Still, in v2.0.8, make_chains.py accepts --fill_chain_min_score when setting up the pipeline, but the value does not seem to be delivered to chain_gap_filler.py or anywhere else. If you intend to use the same chain_min_score for both "chain_run" and "fill_chain" steps, it may be better to remove the --fill_chain_min_score argument from make_chains.py.

MichaelHiller commented 1 day ago

Yes, I understood. Pls keep the minscore of 1000 or 5000 for the chaining step.

For repeatFiller, this would only mean that gaps in the chains below the score threshold would not be filled (but such short chains often have pretty much no substantial gaps)

ohdongha commented 1 day ago

Yes, I understood. Pls keep the minscore of 1000 or 5000 for the chaining step.

For repeatFiller, this would only mean that gaps in the chains below the score threshold would not be filled (but such short chains often have pretty much no substantial gaps)

Thanks for the clarification. We will then continue to use 1000 and 5000 for the chaining ("chain_run") step, and 25000 for the repeatFiller ("fill_chain") step.