hillerlab / make_lastz_chains

Portable solution to generate genome alignment chains using lastz
MIT License
47 stars 8 forks source link

Using an alternative scoring matrix in make_lastz_chains v.2.0.8 #68

Open ohdongha opened 2 weeks ago

ohdongha commented 2 weeks ago

We have been adding BLASTZ_Q parameter when running make_lastz_chains v1.0.0 (MLC v1) to use an alternative scoring matrix (HoxD55 instead of the default HoxD70) for increased sensitivity when aligning more distant species (see issue #12).

Especially outside the mammalian or vertebrate clades, people quite often want to align genome pairs as far as human-chicken, human-frog, or human-zebrafish. We cannot say that HoxD55 is optimal for these pairs, but it has been better than HoxD70 and often retrieves substantially more alignment of CDS and synteny. I think the BLAST_Q or lastz_q in make_lastz_chains v2.0.8 (MLC v2) that allows an alternative scoring matrix is a needed feature.

MLC v2 uses the same run_lastz.py script as MLC v1 to set up lastz job scripts. Hence, we could deliver the lastz_q command similarly to issue #12. We created a one-line JSON file as follows and added it to the parameter set using --params_from_file ${one_line_JSON}:

{ "lastz_q": "${path_to_HoxD55.q_file}" }

With this, lastz was run with the HoxD55 matrix as expected. However, the MLC v2 "chain_run" step was not working the same way as MLC v1. This was because the chain_run_step.py script does not add lastz_q parameter to the axtChain job scripts as -scoreScheme=${matrix} parameter.

We tried to modify the chain_run_step.py as below (lines 53, 79, and 80 with "# add lastz_q if it exists" are newly added):

$ grep -n -A36 "def make_chains_joblist" make_lastz_chains-2.0.8/steps_implementations/chain_run_step.py
47:def make_chains_joblist(project_paths: ProjectPaths,
48-                        params: PipelineParameters,
49-                        executables: StepExecutables):
50-    # Prepare parameters
51-    seq1_dir = params.seq_1_dir
52-    seq2_dir = params.seq_2_dir
53-    matrix = params.lastz_q if hasattr(params, 'lastz_q') else "" # add lastz_q if it exists 
54-    # matrix = ""
55-    min_score = params.chain_min_score
56-    linear_gap = params.chain_linear_gap
57-    bundle_filenames = os.listdir(project_paths.split_psl_dir)
58-    to_log(f"Building axtChain joblist for {len(bundle_filenames)} bundled psl files")
59-
60-    cluster_jobs = []
61-    for bundle_filename in bundle_filenames:
62-        in_path = os.path.join(project_paths.split_psl_dir, bundle_filename)
63-        out_path = os.path.join(project_paths.chain_output_dir, f"{bundle_filename}.chain")
64-        cmd = [executables.axt_chain,
65-               "-psl",
66-               "-verbose=0",
67-               f"-minScore={min_score}",
68-               f"-linearGap={linear_gap}",
69-               in_path,
70-               seq1_dir,
71-               seq2_dir,
72-               "stdout",
73-               "|",
74-               executables.chain_anti_repeat,
75-               seq1_dir,
76-               seq2_dir,
77-               "stdin",
78-               out_path]
79-        if matrix != "" :                           # add lastz_q if it exists
80-            cmd.insert(3, f"-scoreScheme={matrix}") # add lastz_q if it exists
81-        cluster_jobs.append(" ".join(cmd))
82-    return cluster_jobs
83-

This modification will add the matrix parameter ONLY if the lastz_q parameter was ever declared. With this modification, MLC v2 alignments for distant pairs with HoxD55 matrix scored as high alignment CDS coverage as those we did with MLC v1. Without it, the alignment CDS coverages were 4~10% lower than what we had with MLC v1 in multiple tests.

...

This was a bit of a "band-aid" fix, and we had to deliver the lastz_q matrix as a JSON parameter file. This seemed to work for the lastz and "chain_run" steps, printing the job scripts equivalent to those we saw in MLC v1. I am not sure, though, if these two steps are the only place where the matrix parameter needs to be added. I wonder if the "fill_chain" or "clean_chain" step also needs the matrix parameter if we are not using the default. Anyways, delivering the HoxD55 matrix to lastz and "chain_run" steps recovered the alignment stats (CDS coverage) very close to what we had with MLC v1.

You could more formally add the --lastz_q parameter to make_chains.py. One tricky thing is that, unlike other alphabet parameters (lastz_h, lastz_y, etc.), lastz_q cannot be an empty string if it is declared at all. You may have to include the default HoxD70 matrix as a file and use its location as the default value of lastz_q.

MichaelHiller commented 2 weeks ago

Thx a bunch for reporting this together with a fix? @kirilenkobm This is likely a bug, as we don't use the lastz substitution matrix in the chaining step. Can you pls incorporate the change and think about how we can simplify the parameter input?

ohdongha commented 2 weeks ago

This is likely a bug, as we don't use the lastz substitution matrix in the chaining step.

I think make_lastz_chains v1.0.0 (MLC v1) does add the BLAST_Q parameter during the "chain_run" step in doLastzChain.pl in Line #534. The "fix" I posted above was trying to mimic what doLastzChain.pl does there.

And when we used the alternative matrix (HoxD55) for a distant species pair, the alignment stat was better when -scoreScheme=HoxD55.q was added to the axtChain job scripts during the "chain_run" step (i.e., MLC v1) than when it was not added (i.e., MLC v2, before the "fix").