Obtaining second haploid assembly from assembly mode 2

tododge commented 1 year ago

Is it possible to obtain 2 partially phased haploid assemblies in mode 2?

In assembly mode 2, I see that one contiguous pseudo-haplotype is outputted (Assembly-Haploid.fasta) and is chosen based on read support. Is it possible to get the second (less supported) haplotype in this more contiguous form that also contains the unphased regions (for example, an Assembly-Haploid2.fasta)? I am currently using Assembly-Phased.fasta, but would find the Assembly-Haploid.fasta and Assembly-Haploid2.fasta output very useful, despite presence of switch errors.

I am using shasta-Linux-0.11.1 --config Nanopore-Phased-Jan2022.

Thank you!

paoloshasta commented 1 year ago

That is a valid request and is entirely feasible. If I understand it correctly, the contigs of Assembly-Haploid2.fasta would be obtained in the same way as what is done for Assembly-Haploid.fasta, that is by concatenating unphased regions with one haplotype of each phased region. But the haplotype chosen for phased regions would be the one not used for Assembly-Haploid.fasta, correct?

I will look into adding this, but I can't promise how soon it will happen, if at all. In the meantime, it might be possible to get what you want by concatenating contigs of Assembly-Detailed.fasta and with the help of the various csv files that describe the assembly. If you want to try this I can help.

tododge commented 1 year ago

Yes, what you described above is exactly what I want, thank you!

In the meantime, I'd like to try the workaround you mention using the Assembly-Detailed.fasta and the csv files, but am not sure which ones are relevant.

paoloshasta commented 1 year ago

The csv files I was referring to are Assembly-Phased.csv, Assembly-Haploid.csv, and Assembly-Detailed.csv. These are not well documented, but I will provide you with more information very soon. The contigs in Assembly-Phased.fasta and Assembly-Haploid.fasta are obtained by concatenating contigs in Assembly-Detailed.fasta, and the csv files might contain sufficient information for what you want to do. But again, might means that I am not 100% sure of that.

More on this soon, probably tomorrow.

paoloshasta commented 1 year ago

Also Assembly-Phased-Details.csv.

paoloshasta commented 1 year ago

Actually, after reviewing the code I concluded that the csv files don't contain enough information to do what you want. However adding this additional output to the code is relatively straightforward, and I will plan on implementing it this week. But I will only be able to do minimal testing, and so I would rely on you for more extensive testing to confirm that the additional output is correct - although it would include many assembly errors due to selecting the weaker haplotype, in addition to the switch errors also present in the current Assembly-Haploid.fasta.

I do not plan to create a new Shasta release immediately, but you will be able to download from GitHub a test build that includes the new functionality.

tododge commented 1 year ago

I see, thank you for implementing it! I look forward to testing once the test build becomes available.

I have a couple clarifying questions about the additional expected errors you mention. Would you expect them to arise in only the regions with phasing information (PR), or also the unphased regions (UR). In other words, will the regions corresponding to UR in the Assembly-Phased.fasta be exactly identical in Assembly-Haploid.fasta and Assembly-Haploid2.fasta? Relatedly, what information is contained in Assembly-Detailed.fasta versus Assembly-Phased.fasta that makes it a better source to generate the second haplotype?

paoloshasta commented 1 year ago

In the new "weak haploid output", like I decided to call it, the unphased regions (UR) will be identical to the ones in the current haploid output. The additional errors will happen in the phased regions (PR) only. Consider a false heterozygous locus generated by errors in the reads. The strongest branch is still correct and so the existing haploid output is not affected, but the weaker branch is entirely caused by errors in the reads, and that will make it into the "weak haploid" version of that PR.

Assembly-Detailed.fasta contains the details of all the small heterozygous bubbles which are phased during the assembly to generate the large heterozygous bubbles in Assembly-Phased.fasta. But the csv files don't contain enough information to know which side of each small heterozygous bubble was used when generating haploid output.

Depending on what you are trying to do, it might also make sense to get two haplotypes by just concatenating the UR and PR sections of each bubble chain, using Assembly-Phased.csv for guidance. For the first haplotype you just use the PR.*.*.*.0 and for the second haplotype you use the PR.*.*.*.1. This would have a much smaller number of switch errors (in most cases, only when transitioning from a PR to a different PR). However neither of the haplotypes obtained in this way would benefit from the lower error rate that would occur if one were to choose the strongest side of each small branch, as is done in the current haploid output.

If this solution is more appealing to you, let me know and I will not implement "weak haploid output". And let me know if any clarifications are necessary.

There is some information in the Shasta documentation that might help clarify some of the above. Take a look if you haven't already:

https://paoloshasta.github.io/shasta/ComputationalMethods.html#Mode2Output

https://paoloshasta.github.io/shasta/ComputationalMethods.html#Mode2Assembly

tododge commented 1 year ago

Thank you for the clarification! The suggestion of concatenating the UR/PR*1 and UR/PR*2 of each bubble chain is closer to what I'm looking for.

Is this there an option in shasta to generate this output or documented way of doing this? My main concerns are how to deal with when more complex tangled regions occur and numeric segments that are not part of a bubble (even though some look like they should be part of a bubble in the gfa files). I've included an example in the attached image.

Assembly-Phased gfa_numeric_segment_bubble_example

paoloshasta commented 1 year ago

There is no existing way to do that, but my hope is that you will implement one, hopefully in the form of a Python script, and contribute it so I can add it to shasta/scripts and document it for others to use.

This should be easy to do and does not even require parsing Assembly-Phased.csv - you can just use Assembly-Phased.fasta as input to generate your two output files. Each bubble chain appears contiguously and in order in Assembly-Phased.fasta, and in addition each contig appears on a single line of the file, so the process would go like this:

Read one header line from Assembly-Phased.fasta. If the sequence name is numeric (that is, does not begin with UR or PR), skip it and the following line, which contains the sequence. That way your output will not contain the messy regions.
If the name begins with UR, parse it as UR.bubbleChain.position, and if it begins with PR, parse it as PR.bubbleChain.position.component.haplotype.
If this is a new bubbleChain value start a new contig in each of your two output files.
Read the sequence on the next line.
If it is an UR, append it to both of your output files.
If it is a PR, append it to the output file corresponding to the haplotype value you parsed. Note the haplotypes are numbered 0 and 1, not 1 and 2.
Repeat until you reach the end of the file.

So for example here are the header lines (with the ">" removed) for the beginning of bubble chain 5 in a test assembly:

PR.5.1.6.0 1485085 PR.5.1.6.1 1484801 UR.5.2 683 PR.5.3.3911.0 2 PR.5.3.3911.1 0 UR.5.4 846 PR.5.5.6.0 271440 PR.5.5.6.1 271917 UR.5.6 472484 ...

Your first output haplotype would be obtained by concatenating the UR.5.* and the PR.5.*.*.0:

PR.5.1.6.0 UR.5.2 PR.5.3.3911.0 UR.5.4 PR.5.5.6.0 UR.5.6 And your second output haplotype would be obtained by concatenating the the UR.5.* and the PR.5.*.*.1:

PR.5.1.6.1 UR.5.2 PR.5.3.3911.1 UR.5.4 PR.5.5.6.1 UR.5.6

If you don't feel comfortable attempting this let me know and I will do it for you.

The main benefit of this approach will be the much lower rate of switch error. Also, the output generated in this way will agree in detail with the contents of Assembly-Phased.fasta.

paoloshasta commented 1 year ago

Regarding those puzzling segments in the last picture you posted: could you post the same picture, with names on all segments shown? You can just turn on the segment names in Bandage without having to manually label the segments. Also please turn on the arrows showing segment directions.

paoloshasta commented 1 year ago

Also please include more segments to the left of 58590 and 160529. One or two should be enough. If the segments don't reconverge, they should not be assigned to any bubble chain.

tododge commented 1 year ago

Thank you for the explanation -- I definitely understand the concept relating to the PR/UR but may have implementing in python, as I primarily script in bash and R. I can take a crack at it though, once I understand how to integrate the numeric segments.

Here is an image with the two kinds of numeric segments I see in the .gfa files, some (left) where it seems like it should be possible to assign randomly to one haplotype or the other and others (right tangle) where it does not seem possible.

Assembly-Phased gfa_numeric_segment_bubble_tangle

paoloshasta commented 1 year ago

So your concern is that segments 258331 and 2724234 are not assigned to bubble chain 126, together with the PR.126.* segments on their right? Can you show what is on the left of 258331 and 2724234? Also please display the arrows as occasionally these messy regions contain inversions that cannot be resolved.

For your purposes, I think you should just entirely ignore the numeric segments. They belong to regions that could not be resolved into bubble chains, so essentially unresolved repeats. I only include them for completeness, but I have not found any situation in which they were useful. I am working on new assembly methods that should significantly improve the handling of the repetitive regions that generate these tangles, but this is work in progress, and it is not yet clear if it will support R9 reads.

If you can create a bash script, I can certainly add it to shasta/scripts, because it would generate no additional dependencies for Shasta users. R is a different story, but I would be willing to add it if it fails gracefully if R is not installed, with a message describing what happened. So possibly a wrapper bash script that checks for R availability, then calls the R script, and issues a message otherwise.

tododge commented 1 year ago

Yes, exactly, these are not assigned to the bubble chain 126 despite possibly being haplotypes. It looks like from the .gfa that no bubble chains ever end in a PR, so couldn't these two be a phased region that isn't noted as such because it's at the end of a bubble chain.

In this case, 258331 and 2724234 have high homology at the start and end, and the end matches the canonical telomeric sequence. So, it seems to me that they're possibly haplotypes even though they're not denoted as part of the same bubble.

258331 TGGAACTGAGAGAAGTGATGCATCGATGTTTCTGGGTTT...58124...TTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTAAGGGG 2724234 TGGAACTGAGAGAAGTGATGCATCGATGTTTCTGGGTTT...88703...TTAGGGGTTAGGGTAGGGTTAGGGTAGGGTTAGGGG

Nothing is to the left of these nodes. I've looked through the bandage documentation and have only found "arrows" by drawing the graph in double style, is this what you were looking for?

Assembly-Phased gfa_numeric_segment_bubble_tangle_double

paoloshasta commented 1 year ago

No, I find that Bandage "double style" confusing. Please keep Style set at the default "Single", but go to Tools/Settings and under "Graph appearance" set "Arrowheads in single node style" to "On". That will no longer duplicate segments, but each segment will be drawn as before, but with an arrow according to its direction as specified in the gfa file.

In any event, it looks like those two numeric segments are dangling out of the two PR.126.1.314.* segments. I would only assign those two segments to a bubble chain if they reconverged together. If they don't, I means that we don't know what happens on the other side and therefore I conservatively don't add them to bubble chains. I think this is the right thing to do. Situations like these tend to happen on very short segments, so not much total sequence is lost this way.

paoloshasta commented 1 year ago

You are correct that for telomeric sequence this could be considered questionable. On the other hand, due to the repetitive nature of these regions, you don't really know if those two segments align according to the alignment you showed. It is possible that both of them have missing sequence on the telomere side, and that the correct alignment is different. So for lack of a reliable anchor on the left side, I prefer not to assign them to bubble chains.

You could consider parsing the GFA, looking for dangling numeric segments at the beginning/end of a bubble chain. But it is likely that this process would also find similar situations where identifying the two segments with haplotypes may be questionable. And that is certainly more complex than just parsing the fasta file like I described.

tododge commented 8 months ago

Hello again! I have tried for some time to implement a bash script following the detailed advice you posted above, but am afraid this problem is beyond my current abilities. I am wondering if you would still be willing to write a script that concatenates the PR and UR contigs in each bubble chain from the Assembly-Phased.fasta. Thank you for your consideration!

paoloshasta commented 8 months ago

I can give this a cut. So to be clear on expectations: each bubble chain would generate two haplotypes, each a concatenation of PR and UR sequences in the correct order. Because the UR regions are not phased relatively to each other, this means that there could a switch error at each UR region. On average, the number of switch errors should be about half the number of UR regions in the bubble chain.

Please confirm that this is what you want.

tododge commented 8 months ago

Yes, this is precisely what I want. Thank you!

paoloshasta commented 8 months ago

Ok, give me a few days.

paoloshasta commented 8 months ago

Would it be more convenient to output the two resulting haplotypes to a single fasta file or to two separate fasta files?

tododge commented 8 months ago

I think 2 files would be more convenient, if possible!

paoloshasta commented 8 months ago

I added shasta/scripts/GenerateRandomHaplotypes.py. You can download it from the repository, but for convenience I also attached it here (with a .txt extension appended, otherwise GitHub would not accept a .py file).

GenerateRandomHaplotypes.py.txt

After dowloading it and renaming it to remove the .txt extension, you can just use chmod to make it executable and no additional installation is necessary. It will run as long as python3 is available, which it should be on all current Linux systems.

You should invoke it from the assembly directory. Invoke it with --help for more information on its input and output.

The name of the script and of the output file includes "Random" to remind users that generating haplotypes in this way introduces switch errors.

I only did minimal testing, and error diagnostics are also minimal. If you bump into any problems, questions, or issues, please post here.

This file will also be available in the scripts and bin directories of future Shasta releases.

paoloshasta commented 6 months ago

I am closing this due to lack of discussion. Feel free to open a new issue if additional discussion topics emerge.

paoloshasta / shasta

Obtaining second haploid assembly from assembly mode 2 #17