sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Scaffolding should not use weak peaks in the distribution of outer distances #41

Closed sebhtml closed 12 years ago

sebhtml commented 12 years ago

Submitted by : e-ozer AT fsm.northwestern DOTedu

I'm doing assemblies of 100bp paired-end Illumina reads in Ray 1.7. I've been allowing Ray to automatically determine insert sizes, but I'm ending up with two peaks in the LibraryStatistics.txt files where I'm only expecting one. The insert sizes per my sequencing provider were ~ 235 bp, and Peak 0 shows an AverageOuterDistance of 205 and StandardDeviation of 45, but I'm also getting a Peak 1 with an AverageOuterDistance of 5205 and StandardDeviation of 30. Looking at the Library0.txt file, the vast majority of the insert sizes fall into the range of the lower peak, with about 215 reads falling into the upper range. Of the 13 different libraries I've assembled so far, 11 of them have this second peak. Unfortunately, these "long inserts" are being used by Ray's scaffolder to give me huge regions of N's in my scaffolds that I don't think are real. I checked with my sequencing provider and they thought it would be impossible for me to have mate pair contamination of my paired-end libraries since there's no circularization step, so I'm worried this might be a glitch in the automatic insert size determination of Ray. When I do a reference-based alignment with the same library, there seem to be no inserts larger than 517.

mscook commented 12 years ago

We have also found this problem. It's not really a bug, perhaps an oversight. I assume you're calculating zero points/gradients on LibraryStatistics.txt? Wrong/shadow libraries of low frequency (counts of say 20) can cause a 2nd Library to be detected. I suggest setting a threshold to remove artifacts or use a running average type approach to remove the blips.

Scaffolds do have large N runs as a result of this. I think this is a bug.

mscook commented 12 years ago

Any update on this?

sebhtml commented 12 years ago

Hello,

We are not doing fancy things on these distributions. One or two peaks can be detected. I think each peak is required to have enough density, but not anything robust I guess.

Ray only writes LibraryStatistics.txt and LibraryX.txt, it does not read any of these files as it is a single executable.

I think we already do that -- using a threshold. But it is constant.

The code is indeed very simple:

https://github.com/sebhtml/ray/blob/master/code/pairs/LibraryPeakFinder.cpp

The class has two methods: findPeaks (plural form) and findPeak (when something is worth checking I think).

And I think there is a stand-alone program that link with this class in unit-tests/ for debugging purposes.

Looking at the code, the problem is that the code that choose to call 1 or 2 peaks does not have enough information, but findPeak (singular form) should abort if not enough data is observed.

sebhtml commented 12 years ago

There is an occupancy threshold of 20.

I will add also a minimum frequency threshold I guess.

sebhtml commented 12 years ago

Can you provide a file (example: Library.txt) with which you got this bug ?

I will add it to my tests and thus the bug will be fixed.

Sébastien.

On Mon, 2012-01-16 at 21:28 -0800, Mitchell J Stanton-Cook wrote:

Mitchell J Stanton-Cook

sebhtml commented 12 years ago

Fixed /7ba87bd79bf095c9d0ce1741fee4565f3fecabba

sebhtml commented 12 years ago

7ba87bd79bf095c9d0ce1741fee4565f3fecabba