marbl / SALSA

SALSA: A tool to scaffold long read assemblies with Hi-C data
MIT License
178 stars 47 forks source link

assembly error detection isn't much efficient on my data #34

Closed jeanlain closed 6 years ago

jeanlain commented 6 years ago

I'm using a salsa 2 version that I downloaded about a week ago and I'm trying to correct/assemble a genome that we know contains thousands of errors, in which different chromosomes are joined in the same scaffolds. (See https://academic.oup.com/gbe/article/10/2/507/4817508#110136596, second-to-last paragraph of the discussion.)

Using Hi-C (chicago) data mapped with the recommended Arima genomics pipeline on this genome, salsa 2 detected only 63 errors (as listed in the input_breaks file). Note that the draft assembly onto which I mapped the Hi-C reads is made of scaffolds (not unitigs) and doesn't come with a gfa file (it was assembled by somebody else a long time ago). Note also that HiRise corrected ~7700 errors on this genome (I haven't run HiRise myself, dovetail did. Salsa is the only program I used so far).

I looked more closely at the data, focusing on a scaffold that we know contains at least 4 assembly errors (the scaffold outlined in the supplementary material: https://academic.oup.com/gbe/article/10/2/507/4817508#110136605). I computed the physical coverage of Hi-C reads on this scaffold, based on the bed file that was given to salsa.

image

The plot shows the physical coverage of all read pairs whose mates are >500bp appart (excluding these pairs just makes a cleaner curve). The only clear drop in coverage is around 180 kb. It was actually among the very few to be detected by salsa. I then plotted an image like those made by Dudchenko et al. 2016 (http://science.sciencemag.org/highwire/filestream/691788/field_highwire_adjunct_files/0/Dudchenko_SM.pdf, figure S5.)

image

The window size is 5000bp. Five breaks appear starkly, 4 of which correspond to the 4 known assembly errors. It's also clear that, in this scaffold, parts of the X chromosome were somehow inserted into an autosome during assembly, as the number of links drops to zero in a large region, then increases again. This is why the physical coverage over an assembly errors can still be substantial. To show this, I plotted the physical coverage on this scaffold, but excluding pairs of mates that are distant from more than 200kb:

image

Unless I'm mistaken, the current method used by salsa to detect assembly errors could be improved, as it may be fooled by certain types of errors.

ghuryejay commented 6 years ago

Thank you for reporting this. We will add an option to support mate pairs upto certain separation to be used for misassembly correction. I will keep this issue open until we add this feature.

jeanlain commented 6 years ago

On my data, I get best results by only considering pairs of mates that are distant from 20 kb to 40-50kb. Removing pairs of reads that are too close reduces the noise.

jeanlain commented 6 years ago

By selecting read pairs according to the criteria mentioned above and establishing the topography of physical coverage, I found ~1500 errors in the assembly. Still much less than the 7700 breaks made by HiRise, but my detection is conservative. I chose a fixed minimal physical coverage of 10X to delineate a "valley" (= a break), and I merged two valleys if these were separated by a "peak" that isn't higher by more than 20X above both valleys. I'm not sure how salsa delineates breaks. Hopefully, it does something better than my custom script.

Still, I believe that the physical coverage, although straightforward, can only be so sensitive to detect errors. It seems that much more could be extracted from the coordinate of mapped read pairs, as the matrix image posted above clearly outlines breaks better than the physical coverage topography. Dudchenko et al.'s solution of using a sliding triangle along the matrix's diagonal doesn't look optimal either, since it would be handicapped by the off-diagonal clusters (which could in fact help to detect breaks rather than being an hindrance).

ghuryejay commented 6 years ago

Hi Jeanlain, I have modified the code to implement some the ideas you mentioned and it seems to work well on the genomes we have to test with. Please use the latest code on your genome and let us know if it works better now on the genome you were looking at.

ViriatoII commented 5 years ago

Hey, can I ask you for the code to do these physical coverage graphs? They are great!