lastz / lastz

Program for aligning DNA sequences, a pairwise aligner.
MIT License
200 stars 37 forks source link

LastZ_32 is taking a long time to do alignments and is not retrieving alignments #47

Open LuisGFdez opened 2 years ago

LuisGFdez commented 2 years ago

I've been working with the fasta files downloaded from here https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/.

What I've trying to do is to find repeat sequences across the different chromosomes

These are the commands that i have been running.

The second is the one that is giving me problems. It looks like it is never going to end, instead the first command is running relatively fast. The only difference between them is the parameter --chain

lastz_32 ../chroms/chrX.fa[unmask][127680000..156040895] --self --nomirror --strand=minus --identity=80 --seed=match15 --gfextend --matchcount=200 --chain --allocate:traceback=1.99G ‑‑format=general:name1,start1,end1,strand1,length1,name2,start2+,end2+,strand2,length2,number,identity,score > chrX_chain.txt

lastz_32 ../chroms/chrX.fa[unmask][127680000..156040895] --self --nomirror --strand=minus --identity=80 --seed=match15 --gfextend --matchcount=200 --allocate:traceback=1.99G ‑‑format=general:name1,start1,end1,strand1,length1,name2,start2+,end2+,strand2,length2,number,identity,score > chrX_nochain.txt

rsharris commented 2 years ago

Howdy, @LuisGFdez

I'm not surprised that the --chain option reduced your runtime considerably in this case. At the end of the ungapped stage, it will discard most of the tandem repeat HSPs, and these are likely to be the ones that are dragging down the unchained run.

(You may have already found this) There's a brief description of what --chain does, at https://lastz.github.io/lastz/#stage_chaining

Lastz's achilles heel is aligning tandem repeats, especially if they have diverged. This is discussed in many of the other issues. The best solution (or workaround) depends on what you are trying to find. For example, for simple tandem repeats with no (or very little) divergence you'd be better off running TRF or RepeatMasker first, then masking those out before your lastz run.

Your --chain run of lastz probably isn't revealing many (if any) repeats.

Anyway, if you can post a little info about what kind of things you're looking for, I may have a way to accomplish it. E.g. are you looking for any region of a chromosome that is similar to any other on that chromosome? Or in the genome? Or just exact matches?

Bob H

LuisGFdez commented 2 years ago

In fact, I am looking for tandem repeats and other repeat sequences that lastz may find such as sines and lines. Also, I am looking for matches with, at least, 80 percent of identity in each one of the chromosomes and if I work with the softmasked version, most of the repeats will be discarded from the analysis that is the reason why I've been using the option "unmask". I am trying to do this for each one of the chromosomes independently. In the example that I show you above, I specified the coordinates because working with the whole chromosome was taking a long time, and i thought the reason was the size, but anyway it happens the same working with a short region. When I add the option --chain, everything it is solved, but as you said I might be losing some repeats

LuisGFdez commented 2 years ago

what i mean by "taking a long time" is that it never end of running the lastz command in the cases without the --chain option

rsharris commented 2 years ago

I'll suggest some options shortly, in a separate post.

In fact, I am looking for tandem repeats and other repeat sequences that lastz may find such as sines and lines. Also, I am looking for matches with, at least, 80 percent of identity in each one of the chromosomes

So, pretty much any duplicated element, allowing for ≈20% divergence.

and if I work with the softmasked version, most of the repeats will be discarded from the analysis that is the reason why I've been using the option "unmask".

Right. Softmasked bases are eliminated from the seeding process. But can be part of reported alignments only if an alignment extends into them from a nearby seeded match.

I am trying to do this for each one of the chromosomes independently. In the example that I show you above, I specified the coordinates because working with the whole chromosome was taking a long time, and i thought the reason was the size, but anyway it happens the same working with a short region.

Size is definitely a contributing factor. The number of HSPs will be something like O(kL^2) where L is the length and k relates to self-similarity. If k is low the gapped stage will be close to linear in the number of HSPs. But in regions where repeat density is high (like tandem repeats), runtime becomes super linear.

For uses like yours, breaking things into pieces is usually part of the solution. Your piece size is reasonable (about 30Mbp).

rsharris commented 2 years ago

I've had success recently by separating the gapped stage from the ungapped stage (i.e. running lastz separately for each). Conceptually this isn't different than what lastz does internally, except that this allows me to use different filtering options on the two stages. Here's an example:

    minHspMatch=200
    minGappedMatch=500      # this may be too long for your intended use
    hspThresh=$((91*minHspMatch))

    # create filtered HSPs (anchors) ...
    lastz \
        chrX.fa[unmask][120000000..150000000] \
        chrX.fa[unmask][150000000..180000000]  \
        --nogapped \
        --filter=identity:80 \
        --filter=nmatch:${minHspMatch} \
        --hspthresh=${hspThresh} \
        --format=general-:name1,start1,end1,name2,start2,end2,strand2,nmatch \
        --progress=1 \
      > X12_vs_X15.anchors

    # align from those anchors ...
    lastz \
        chrX.fa[unmask][120000000..150000000] \
        chrX.fa[unmask][150000000..180000000]  \
          --segments=X12_vs_X15.anchors \
          --allocate:traceback=800M \
          --filter=identity:75 \
          --filter=nmatch:${minGappedMatch} \
          --format=general:name1,start1,end1,strand1,length1,name2,start2+,end2+,strand2,length2,number,id%,score \
        --progress=1 \
      > X12_vs_X15.dat

I haven't actually tried that on hg38 chrX. But I used it on one of the newer T2T assemblies of chrY, in a region that contained a beta satellite repeat. Different copies of the repeat element had identity below ≈81%. From a dotplot it was clear that I hadn't captured all alignments, so presumably some must have been below the 75% cutoff I used in the gapped stage (or perhaps they contained no suitable HSP and got lost in the ungapped stage).

I'll try to run the commands above today and see how they do.

rsharris commented 2 years ago

I ran something like what what I show above, aligning a 20M piece to a 16M piece (commands below). Run time was a little over five hours, 10 minutes of which was for the ungapped.

The result was ≈150,000 alignments. The longest of these was less than 10Kbp. Shortest were a little over 500 bp, but this was the result of requiring alignment to contain 500 matched bases. Identity ranged from 80.1% to 99.6%. I was expecting to see some between 75% and 80% but didn't. Not entirely sure why not — it could be that I'd need to set --filter=identity to 75 in the ungapped stage as well.

I'll post a dotplot below too, but not sure that really tells us much.

The real question is how much sensitivity was lost by eliminating the short HSPs. The default parameters have hspthresh=3000. This can hypothetically be satisfied by an exact match of 32bp. In regions with i.i.d 80% identity an HSP would have to be a little longer to reach that, probably in to 40 to 50bp range. So we're throwing away a lot of HSPs.

Now, theoretically, if a homology exists we only need to find one HSP in it. In my 2007 thesis I explored the effect on overall sensitivity resulting from some parameters. But not of this type of filtering. So I really can't say what gets missed here.

The commands used:

minHspMatch=200
minGappedMatch=500      # this may be too long for your intended use
hspThresh=$((91*minHspMatch))

# create filtered HSPs ...
time lastz \
    hg38.2bit/chrX[unmask][120000000..140000000] \
    hg38.2bit/chrX[unmask][140000000..156040895]  \
    --nogapped \
    --filter=identity:80 \
    --filter=nmatch:${minHspMatch} \
    --hspthresh=${hspThresh} \
    --format=general-:name1,start1,end1,name2,start2,end2,strand2,nmatch \
    --progress=500 \
  > X12_vs_X14.anchors

# align from those anchors ...
time lastz \
    hg38.2bit/chrX[unmask][120000000..140000000] \
    hg38.2bit/chrX[unmask][140000000..156040895]  \
      --segments=X12_vs_X14.anchors \
      --allocate:traceback=800M \
      --filter=identity:80 \
      --filter=nmatch:${minGappedMatch} \
      --format=general:name1,zstart1,end1,name2,strand2,zstart2+,end2+,nmatch,length1,id% \
      --rdotplot+score=X12_vs_X14.dots \
      --progress=500 \
  > X12_vs_X14.lz

The dot plot: dotplot

LuisGFdez commented 2 years ago

It looks really good, I'll try it on my data and then tell you about my results. I would like to add the parameters "self" because the idea is to perform a self alignments and "strand=minus/plus" because the idea is to find direct and inverted repeats separately. Is it possible to add them in example shown above?

rsharris commented 2 years ago

I would like to add the parameters "self" because the idea is to perform a self alignments and "strand=minus/plus"

Those should both work, but let me run a quick experiment this morning regarding strand=minus. I need to check how strand is recorded in the anchors file when strand is specified on the command line.

rsharris commented 2 years ago

Unfortunately, I never implemented the combination of --segments with --self, so lastz rejects that when it parses the command, with this message "--segments can't be used with --self". At this point I don't remember why I didn't implement that. This would have been around 2005/2006. At this point I'm not making changes like that to the code, unless it's a bug — either a crash or something giving wrong answers.

I've also found some inconsistencies using --self --nomirror with --strand=minus. Specifically, I ran (a) --self --nomirror without --strand, then (b) --self --nomirror with --strand=plus and (c) --self --nomirror with --strand=minus. Some of the minus strand HSPs went unreported in (c). This looks like a bug to me, so I need to dig into that. EDIT: The bug appears to have been a mastaken report. See issue 49.

For you, I think the workaround would be to use --self --nomirror without --strand in the ungapped stage. Then for the gapped stage, give the query same as the target (instead of --self). And don't use --strand. Then postprocess the output to separate the forward and reverse alignments into two files. The postprocessing would also check for and discard the trivial self-alignment.

To clarify, this is what I mean (plus the post-processing):

minHspMatch=200
minGappedMatch=500
hspThresh=$((91*minHspMatch))

lastz \
    hg38.2bit/chrX[unmask][120000000#500K] \
    --self --nomirror \
    --nogapped \
    --filter=identity:80 \
    --filter=nmatch:${minHspMatch} \
    --hspthresh=${hspThresh} \
    --format=general-:name1,start1,end1,name2,start2,end2,strand2,nmatch \
    --progress=1 \
  > X12_vs_X12.self.both.anchors

lastz \
    hg38.2bit/chrX[unmask][120000000#500K] \
    hg38.2bit/chrX[unmask][120000000#500K]  \
    --segments=X12_vs_X12.self.both.anchors \
    --allocate:traceback=800M \
    --filter=identity:80 \
    --filter=nmatch:${minGappedMatch} \
    --format=general:name1,zstart1,end1,name2,strand2,zstart2+,end2+,nmatch,length1,id% \
    --progress=1 \
  > X12_vs_X12.self.both.lz
LuisGFdez commented 2 years ago

A quick question: I was not able to find the parameter "ungapped" in lastz's documention, I think it should be "nogapped". Is it the same? And how do you know that the score threshold for the x-drop extension method that you set is the most adequate? I mean, why did you multiply 91*minHspMatch?

rsharris commented 2 years ago

How do you know that the score threshold for the x-drop extension method that you set is the most adequate? I mean, why did you multiply 91*minHspMatch?

Right, I don't know that's the best threshold. The idea was that I wanted HSPs equivalent to an exact match of length minHspMatch. In the default scoring matrix, 91 is the lowest score given to matched bases.

I was not able to find the parameter "ungapped" in lastz's documention, I think it should be "nogapped". Is it the same? And how do you know that the score threshold for the x-drop extension method that you set is the most adequate? I mean, why did you multiply 91*minHspMatch?

Rats. Sorry about that. Yes, they are identical. I should use --nogapped, especially when I post examples.

Lastz's command line parser accepts several aliases. Most of these are to maintain backward compatibility. I should probably document those some place. Regardless, I shouldn't use undocumented things when I post.

I'm about to edit that in this issue. And I need to go through other posts and correct those too. Thanks for pointing that out.

LuisGFdez commented 2 years ago

I should have mentioned this before, but I require an Identity >= 80% and a Min. length = 200 bp which is already done but I also need no gaps in my matches, so I was wondering if this is accomplished in the second running of lastz

rsharris commented 2 years ago

... also need no gaps in my matches, so I was wondering if this is accomplished in the second running of lastz

This is what the ungapped stage produces. So you should be able to eliminate the second running altogether. (I think) you'd just need to add --nogapped to that initial command you posted.

I require an Identity >= 80% and a Min. length = 200 bp

Looking back at the command you originally posted, I see you have --matchcount=200. (BTW, this is old syntax for what is now --filter=nmatch:200). If your goal is to have a minimum length of 200bp, this option isn't exactly right. For example, a 210bp alignment with 190 matches and 20 mismatches would be rejected.

I didn't implement a filter based on length (I'll describe why not in a moment). Filtering by length would be easy to implement with post-processing.

The reason I didn't implement a length filter is that it would lead to inconsistencies in what gets discarded. Suppose we are filtering for identity≥80% and length 200. We would keep alignment A with 160 matches and 40 mismatches but discard alignment B with 199 matches and no mismatches. B would be satisfactory if we were to arbitrarily extend it on either end by adding one mismatch. But we don't usually include mismatches or gaps at an alignment's end, and doing that for B would reduce its identity. I considered reporting shorter alignments, like B, without modifying them. But I felt it would be difficult to explain why alignments shorter than the specified length could be reported.

--filter=identity:80 --filter=nmatch:160 would produce all the alignments that are either 200 bp long or can be extended to 200 by adding mismatches, and still have identity≥80%.

LuisGFdez commented 2 years ago

Thank you so much for the feedback. I wil add the --nogapped parameter to the initial command posted here. I hope it improves the running time even without the "chain" command. Because the second command posted is giving so many more repeats than those that I was getting with the initial command

LuisGFdez commented 2 years ago

Hello, it's me agaín.

Lastz is performing a really good job since I add the parameter --nogapped. I've found the repeats that I was looking. They overlap with regions from other datasets.

This the command that I've been running:

lastz_32 ../chroms/chrX.fa[unmask]--self --nomirror --strand=minus --seed=match15 --gfextend --nogapped --filter=identity:80 --filter=nmatch:160 --allocate:traceback=1.99G ‑‑format=general:name1,start1,end1,strand1,length1,name2,start2+,end2+,strand2,length2,number,identity,score > chrXinv_nogapped.txt

The trouble is that I am getting so many repeats that overlap in the same region as you can observe in the screenshot which complicate the post-processing with other algorithms that I am using .So I was wondering if there is a way to reduce the number of repeats that overlap in the same region. The blue ones are in the positive strand, while the green ones are in the negative strand. Captura de pantalla de 2022-05-12 10-42-49 Captura de pantalla de 2022-05-12 10-42-37

rsharris commented 2 years ago

So I was wondering if there is a way to reduce the number of repeats that overlap in the same region.

Lastz doesn't have anything that does that. For one thing, it would need to know the user's criteria for which alignment is more important than which other (score? length? identity?).

You should be able to implement some relatively simple post processing to do this though. Here's a relatively simple awk script that uses a greedy algorithm, scanning along the chromosome, choosing the next non-overlapping alignment (preferring longer to shorter) and discarding any other alignments that overlap that one.

NOTE: I have not run this — it may contain errors.

cat chrXinv_nogapped.txt
  | awk '/^#/ { print $0 }
        !/^#/ { print $0 | "env LC_ALL=C sort -k 9,9d -k 4,4d -k 2,2n -k 3,3nr" }' \
  | awk '{
         name   = $1;
         start  = $2;
         end    = $3;
         strand = $9;
         nameWithStrand = name""strand;
         if ((nameWithStrand != prevNameWithStrand) || (start > prevEnd))
           {
           print $0;
           prevEnd = end;
           }
         prevNameWithStrand = nameWithStrand;
         }' \
  > chrXinv_nogapped.overlap_removed.txt
LuisGFdez commented 2 years ago

I think, i figured out what was the problem and the reason of the alignments overlapping in the same region was due to how the seed were being built. This is the command that I am using to reduce the seed sensibility: lastz_32 ../chroms/chrX.fa[unmask] --self --nomirror --strand=minus --step=10 --maxwordcount=1 --seed=match11 --notransition --notwins --gfextend --nogapped --filter=identity:80 --filter=nmatch:160 --allocate:traceback=1.99G ‑‑format=general:name1,start1,end1,strand1,length1,name2,start2+,end2+,strand2,length2,number,identity,score > chrXinv_26.txt

Increasing the step size as well as the seed match length with no transitions and reducing the maxwordcount look that they are doing a good job at reducing the number of overlapped alignments that I was getting, which is what I was looking for to reduce the number of alignments.

What I do not understand is why a bigger value(from the value that I posted here in the command line) for "step" and "seed=match" is giving me more alignments. .I would expect the bigger the size the less alignments I would get, but it is not working like that. Instead the smaller the size for step and seed match the less alignments I am getting

rsharris commented 2 years ago

I think, i figured out what was the problem and the reason of the alignments overlapping in the same region ...

To be clear, this isn't a "problem" — the program was designed to report alignments without regard for whether they overlap other alignments.

This is the command that I am using to reduce the seed [sensitivity] ... doing a good job at reducing the number of overlapped alignments that I was getting, which is what I was looking for to reduce the number of alignments.

OK. Just realize that what you're doing is, probabalistically, reducing all alignments, not just those that overlap. It's possible that you are causing some alignment that you'd like to find, and which doesn't overlap any other alignments, to be missed. Just make sure aware that that is what you are doing.

What I do not understand is why a bigger value ... for "step" and "seed=match" is giving me more alignments. I would expect the bigger the [step] size the less alignments I would get, but it is not working like that.

My expectation matches yours — shorter step should produce more (or at lease as many) ungapped alignments. The relationship won't usually be linear, but I'd expect them to move in opposite directions. (I'm assuming you weren't changing other parameters between tests). The only wild guess I can think of is that low step creates so many seeds that the hash collision rate increases dramatically (which would result in the loss of HSPs). But I don't recall ever seeing that happen.

Are you still using hg38 chrX? I can try some experiments to see if I see the same strange trend you're reporting.

Thanks

LuisGFdez commented 2 years ago

Yes, I am still using hg38 chrX

LuisGFdez commented 2 years ago

Yes, I was not changing other parameters between tests. I just changed step and seed=match values and I'm trying to figure out what values ​​are best to avoid overlapping alignments that are basically the same(only differ by 1 or 2 bases) but keep the ones that are important.

rsharris commented 2 years ago

I'm trying to figure out what values ​​are best to avoid overlapping alignments that are basically the same(only differ by 1 or 2 bases) but keep the ones that are important.

I still say the best way to do that is by post-processing the alignments so that you can programmatically identify which are the important ones.

(quoting myself) ... shorter step should produce more (or at lease as many) ungapped alignments ...

My plan is to run this on a 500K piece of chrX, varying step from 1 to 20 and counting the number of alignments reported. I will start with essentially the defaults, then add the other options you are using, one by one, to see where the number of alignments doesn't decrease with larger step.

Starting with --self --strand=minus --nomirror --nogapped --step=${stepSize}, the number of reported alignments consistently decreased as step increased. I.e. the expected behavior, from ≈19,000 down to 9,500. There was one exception in that step=18 reported ≈2% more than step=16 and step=17.

If I don't get the unexpected behavior after adding all your options, I'll try again with a larger piece of chrX. The advantage of using a short but non-trivial piece is that things run quicker.

I'm squeezing this in between several other things I'm working on, so it may take me a few days.

LuisGFdez commented 2 years ago

I've been running the command whit the enitre X chromosome and it takes a couple of minutes

rsharris commented 2 years ago

I'm going to call that unexpected trend, alignment count increasing as step increases, the anti-trend.

Based on my experiments, it looks like the combination of --maxwordcount=1 with either of --seed=match11 or --notransition is when the anti-trend occurs.

I can't explain that, but I'll continue looking.

I used a 2Mbp interval of chrX in these tests — chrX[unmask][120000000#2M].

Trying to think through how this anti-trend could happen ...

--maxwordcount=1 means that only seed words that are unique in the target are used. Equivalently, every alignment reported will contain (in the target side of the alignment) at least one seed word that is unique in the target. It is possible that increasing the length of the target will reduce the number of unique seed words, which would be expected to reduce the number of alignments. However, that effect doesn't come into play in my tests, since I used the same target throughout.

The positional distribution of those unique seed words won't be quite uniform. Wherever the target contains tandem repeats (longer than twice the seed span), there will be no unique seed words.

The effect of --step=N is to discard seed words at positions that are not a multiple of N. Thus if the seed words are positionally distributed uniformly, this would be expected to filter out (N-1)/N of them. Now, consider an alignable region that contains M seed words. If we find any one of those M we'll report the alignment. Assuming positional independence of those M, we'd expect to have a ((N-1)/N)^M chance that all M were discarded, thus a 1-((N-1)/N)^M chance of reporting the alignment.

Hmmm... I can't see how that would affect the trend. --maxwordcount=1 just reduces the number of seed words, i.e. reduces M (in a stochastic sense) in every alignment. Maybe the non-uniform nature of the positional distribution is what causes this?

I'm also surprised that I didn't see the anti-trend when I used --maxwordcount=1 by itself.

rsharris commented 2 years ago

I also considered whether the fact that target and query are the reverse complements of each other might be involved (spoiler: it isn't). For instance, the discussion in my previous post doesn't account for the fact that if a seed word is unique in the target it's also (with --seed=match11) unique in the reverse complement.

But repeating (some of) the experiments for two non-overlapping segments of chrX I still see the anti-trend.

LuisGFdez commented 2 years ago

Hi, I was just reading your PhD thesis, and I just found out that one way to remove all the noisy alignments is using the twin hit seed strategy. From what I've learnt and you also mentioned it in your thesis is that "for sequences with high similarity, BLASTZ seeding strategies are too sensitive, resulting in too many seed hits".

I was thinking that the twin hit seed strategy reduces the sensitivity of the seeds but increases its specificity.So I believe that running lastz with this new parameter I could avoid the "anti-trend pattern".

lastz_32 ../chroms/chrX.fa[unmask] --self --nomirror --strand=minus --seed=match15 --notransition --twins=-10..10 --gfextend --nogapped --filter=identity:80 --filter=nmatch:160 --allocate:traceback=1.99G ‑‑format=general:name1,start1,end1,strand1,length1,name2,start2+,end2+,strand2,length2,number,identity,score > chrXinv_twinhits.txt

But I am getting this new issue

seed hit queue shortfall at 136264366/24562

Also, would you recommend me running this new command with a specific step and seed size?

rsharris commented 2 years ago

seed hit queue shortfall ...

Well that is an interesting message. Searching the source code, I find it's essentially a sanity check for when a certain data structure (the seed hit queue) is full. I left a note for myself there as to how to handle that, but I never implemented it.

It'll take me some digging to figure out how to resolve this, and/or what parameterization you would need to avoid this condition.

LuisGFdez commented 2 years ago

hey, I am curious about setting a threshold for terminating the gapped extension. I mean, when my alignments are extended the the identity of several of the alignments decrease below 80% if we take into account the alignment with the gaps. And I am not sure if by doing multiples tests I can find a threshold for --ydrop or this might be solved in a post-processing stage.

rsharris commented 2 years ago

There are two cases to consider. One is if you have a nice alignment X and gapped extend allows some lower-identity junk J to be included on each end. While J may be lower identity, if it has positive score (under the scoring model) X+J will score higher than X alone, thus X+J is what gets reported. (I call this a random tail.) You could try increasing the penalties for gaps and mismatches in the model.

The other case if you have nice alignments X and Y and junk J between them. If J scores better than -ydrop gapped extension can wander from X into Y. If X an Y each score higher than J is negative, then X+J+Y will score higher either X or Y by themselves, and X+J+Y is what gets reported. (I call this a random bridge.) You can could try decreasing ydrop (i.e. closer to zero) and also increase the penalties for gaps and mismatches in the model.

But your point is well-taken — it would be difficult to know how to set those scores. On occasions where I needed to deal with this, I used post-processing, scanning the alignment to try to identify intervals that had a higher error desnity (gap or mm) than the rest of the alignment.

LuisGFdez commented 2 years ago

Sounds good, but what is the default score set for matches and mismatches? I was planning to test it with a --match=1,5 --gap=500,40 and ydrop=200

rsharris commented 2 years ago

That's a matrix shown at the bottom of this section: https://lastz.github.io/lastz/#options_scoring

Default gap scoring depends on whether you use --match or not. This was a backwards compatibility issue, and because --match didn't exist in blastz its presence on the command line allows lastz to scale gap penalties with the mismatch penalty. If you don't use --match, the gap penalties are 400 for open, 30 for extend.

You can use --help=defaults to see what params it's gonna use. At least for most of them. It reports them and quits. These are affected by what else is on the command line. And (unfortunately) you have to include a name for the target (which doesn't have to be accurate).