multiz / multiz

DNA multiple sequence aligner, official version from Penn State's Miller Lab
MIT License
31 stars 6 forks source link

inconsistent row size #6

Open Ruttix opened 2 years ago

Ruttix commented 2 years ago

Hello, I'm trying to use roast in order to project alignments of 30ish plant species. I get the error 'inconsistent row size' and afterwards that the corresponding multiz command has failed. An example: multiz M=1 /tmp/_MZ_6588_left.maf22 /tmp/_MZ_6588_right.maf22 0 /tmp/_MZ_6588_U1 /tmp/_MZ_6588_U2 >> /tmp/_MZ_6588_MZ22

multiz.v11.2: line 5803179 of /tmp/_MZ_6588_right.maf22: inconsistent row size

roast.v3: command 'multiz M=1 /tmp/_MZ_6588_left.maf22 /tmp/_MZ_6588_right.maf22 0 /tmp/_MZ_6588_U1 /tmp/_MZ_6588_U2 >> 254 /tmp/_MZ_6588_MZ22' failed

This happens for different versions of alignments (containing different species). The weird thing is this error of inconsistency can show up but the roast continues. Would you be kind to lead me in any kind of direction as to how to troubleshoot this? For context, I made the alignment using a pipeline that uses the last aligner, and the roast is using toast2.maf files.

Thank you!

rsharris commented 2 years ago

(I'm pasting here a separate email where this was reported by someone else(!). The messages have some differences so for the sake of completeness I'm copying that email here.)

I was wondering if you are able to help me solve an error message I encountered using multiz.

I am aligning multiple avian genomes to a reference genome (L. tetrix) to create a .maf file. I have used LASTZ to align all query genomes to the reference genome, and am now using roast to generate the multiple species alignment file. However, my pipeline gets stuck at an error saying that there is an inconsistent row size. This is the end of the log file:

multiz M=1 /tmp/_MZ_2339053_left.maf8 /tmp/_MZ_2339053_right.maf8 0 /tmp/_MZ_2339053_U1 /tmp/_MZ_2339053_U2 >> /tmp/_MZ_2339053_MZ8 multiz.v11.2: line 117256 of /tmp/_MZ_2339053_right.maf8: inconsistent row size roast.v3: command 'multiz M=1 /tmp/_MZ_2339053_left.maf8 /tmp/_MZ_2339053_right.maf8 0 /tmp/_MZ_2339053_U1 /tmp/_MZ_2339053_U2 >> /tmp/_MZ_2339053_MZ8' failed

I am using the msa_pipeline to generate the roast.maf file, so see their documentation for the exact software used in the pipeline to get to the roast call. You can check out the /_MZ_2339053_right.maf8 file here if you like: https://wetransfer.com/downloads/08f544bf991576876fa85909bc8dedfb20221111084513/a7d63d8cf143bd636073d57ab9930cca20221111084529/376830

With the limited details I’ve provided you with I realize this is not a reproducible example, but at this stage I am wondering if you could explain to me what this error message of inconsistent row size means, so I can try to identify the problem in my data myself?

Many thanks in advance,

rsharris commented 2 years ago

As with any multiz issues, I have to start with this disclaimer:

Support for multiz is pretty slim. I created this repo a couple years ago for the reasons mentioned in README.md, but while I am the only person listed as a contributor (under two accounts), I'm not the original author.

Moreover, the lab in which multiz was developed hasn't technically existed since the P.I. retired several years ago.

The original author is Min Mei Hou. Last I knew, she was at Northern Illinois, but I don't know if she is still there.

That said, I can try to answer simple issues.

rsharris commented 2 years ago

ruttix wrote:

For context, I made the alignment using a pipeline that uses the last aligner, and the roast is using toast2.maf files.

To make sure I understand you, you used the pairwise aligner at gitlab.com/mcfrith/last ? multiz is more commonly used with lastz. I don't know yet whether this is relevant to the issue. But I don't want to try to figure out if last is the issue if you aren't in fact using it.

rsharris commented 2 years ago

The emailer whose name I have not revealed wrote:

You can check out the /_MZ_2339053_right.maf8 file here if you like: https://wetransfer.com/downloads/08...

Thanks. I followed the link and it wants me to sign in. Not sure yet if I can view that for free or not. Will check on that later.

rsharris commented 2 years ago

multiz.v11.2: line 5803179 of /tmp/_MZ_6588_right.maf22: inconsistent row size OR multiz.v11.2: line 117256 of /tmp/_MZ_2339053_right.maf8: inconsistent row size

The most likely cause is that the output file from one process was truncated. And then when it was used downstream a sanity check triggered.

What the error message specifically means is that two nucleotide strings (including letters and hyphens) in the same alignment block don't have the same length. Looking at "the simple example" of MAF at http://genome.ucsc.edu/FAQ/FAQformat#format5 , the nucleotide strings in a block all have the same length.

The line number given in the error message would be the number of the line where the error was detected. If it's a file on disk, you should be able to look at that line and the lines preceding it in the same alignment block, and see the problem. My best guess is that this will be the last line in the file, because the file was truncated.

If it isn't the last line, you should notice the difference in lengths. This would indicate some problem with whatever created the file.

That's my best guess. Feel free to post followups.

rsharris commented 2 years ago

The emailer whose name I have not revealed wrote:

I am using the msa_pipeline ...

Thanks (very much!) for including that. At this point I've only briefly glanced at it. But that will be helpful is the root of the problem is something other than file truncation.

rshuhuachen commented 2 years ago

Thanks a lot Bob for your help, I'm the person from the email and thought it would be easier to respond on here.

Your explanation of the error message makes sense in theory and that's also how I interpreted it. Maybe I understand it wrong, but I think that would mean that the alignments from the different species (both target/reference and query ones) do not have the same length. However, this is the case for most of my alignment blocks and those don't show any errors.

In case you cannot open the wetransfer file, I copied the important information below (i.e. the last few code blocks of the file that contain the error). Here are the last three alignment blocks from my .maf file, for which the error code happens at line 8 in this case. I just added these line numers for ease of referencing to them on here, they're not in the original file. I also shortened the sequences by replacing the rest with ***.

1 score=53327.0
2 Lyrurustetrix.ScEsiA3_19038;HRSCAF=23783 37999201 688 + 109349036 AAAAAAGTG***
3 Chrysolophuspictus.KZ861216.1             2082782 663 -   2160913 AAAAAAGTG***

4 a score=94939.0
5 s Lyrurustetrix.ScEsiA3_19038;HRSCAF=23783 37999966 1151 + 109349036 TTTTTTTTTTCCCCT***
6 s Chrysolophuspictus.JRFT01062866.1               2 1167 +      2082 TTTTTTTTTTCCCCTC***

7 a score=1575323.0
8 s Lyrurustetrix.ScEsiA3_19038;HRSCAF=23783 38001117 19335 + 109349036 AGGTG***
9 s Chrysolophuspictus.KZ861216.1             2084670 19381 -   2160913 AGGTATG***

As described on UCSC (http://genome.ucsc.edu/FAQ/FAQformat#format5), the third column refers to the length of the aligning region in the source sequence (which I think is the one that the error message refers to as being different in length, in block 1 688 and 663, in blcok 2 1151 and 1167 and in block 3 19335 and 19381). As you can see, these lengths are different for all three alignment blocks, not just the last one with the error.

So long story short, I'm still not sure what the error really refers to and therefore am not sure where to start with solving the problem. Do you have a better idea after seeing this example?

Some additional info: I tried leaving out species Chrysolophus pictus but when I do this, error messages from alignments of other species show up.

I'm currently trying to align with minimap2 rather than lastz to see if it makes a difference in the roast call. Will keep you updated!

Thanks again, Rebecca

Ruttix commented 2 years ago

Heya, Thanks for responding! I am indeed using the last aligner, and not lastz, mostly because it is included in the msa_pipeline which I was hoping would save me the time of making all the steps from genomes to conservation scores manually. I am currently running roast manually, as in without the pipeline, and it is currently ongoing. It had gone further in the roast than it did with the pipeline, I'll update with the results.

I'm not sure what do you mean by difference in length - the alignments are indeed not all the same in length. What does roast check triggering this? Also, for my situation, the roast was running on LSF, and I couldn't check the temp files this error was indicated on. I'm re-running it again with access to those files to take a look.

rsharris commented 2 years ago

rshuhuachen posted:

As described on UCSC (http://genome.ucsc.edu/FAQ/FAQformat#format5), the third column refers to the length of the aligning region in the source sequence ...

Right, but that's the length of the aligning region, not the length of the alignment text. The text (usually) includes some hyphens. While I don't see anything in the MAF spec that requires text to have the same length for all components in a block, the block doesn't make sense as a multiple alignment if some text has a different length.

I'll create a free wetransfer account this morning and see if I can access the example you linked.

rsharris commented 2 years ago

Ruttix posted:

I'm not sure what do you mean by difference in length - the alignments are indeed not all the same in length. What does roast check triggering this?

See the reply to rshuhuachen I posted about ten minutes ago.

I am indeed using the last aligner, and not lastz ...

That should be fine. I haven't used last much myself, other than a few small test cases several years back. Martin Frith is top notch and I expect last is as well.

rsharris commented 2 years ago

@rshuhuachen I've downloaded your example, _MZ_2339053_right.maf8. It does appear to have been truncated. Below is what I see at the end of the file. The final line abruptly stops at CATAAT and there isn't even a newline character after it.

Multiz reported the problem being at line 117256 which is one off from where I'm seeing the problem. I can't explain that. It's possible multiz starts counting from zero; it's hard to tell looking through the source code.

So now the question is what caused the file to be truncated. Usually this is some sort of system issue. Possibly a process moving the file from one disk to another was terminated, or such process wasn't finished (e.g. stalled) before a downstream process started reading the file. Or perhaps the process creating the file in the first place was terminated for some reason.

I don't know exactly what created _MZ_2339053_right.maf8. I think your best bet is to find that command line, rerun it, and see if the new result has more than 117257 lines.

117251 a score=94939.0
117252 s Lyrurustetrix.ScEsiA3_19038;HRSCAF=23783 37999966 1151 + 109349036 TTTTTT***
117253 s Chrysolophuspictus.JRFT01062866.1               2 1167 +      2082 TTTTTT***
117254
117255 a score=1575323.0
117256 s Lyrurustetrix.ScEsiA3_19038;HRSCAF=23783 38001117 19335 + 109349036 AGGTGC***CATAAT***(another ≈17K characters here)
117257 s Chrysolophuspictus.KZ861216.1             2084670 19381 -   2160913 AGGTAT***CATAAT
rshuhuachen commented 2 years ago

Thanks @rsharris, I was just about to post something similar. Indeed the lengths including the hyphens/mismatches are not the same for the alignment block that throws the error.

If your suspicion is that maybe it's a system issue rather than a problem with the sequences, potentially it has to do with the storage space required in the /tmp/ directory? I will play around with this and let you know! Thanks a lot for all your help

rsharris commented 2 years ago

@rshuhuachen

... potentially it has to do with the storage space required in the /tmp/ directory?

Yeah, that'd be a good thing to check. IIRC the default size of /tmp on most linux systems is pretty small by bioinformatics standards.