sebhtml / ray

Ray -- Parallel genome assemblies for parallel DNA sequencing
http://denovoassembler.sf.net
Other
65 stars 12 forks source link

Ray v1.7. Fails to produce scaffolds in hybrid Ill-PE 454-PE assembly. Potential bug where PE-454 data with large inserts is not detected. #43

Closed mscook closed 11 years ago

mscook commented 12 years ago

Backgound

I believe this relates to: http://sourceforge.net/mailarchive/forum.php?thread_name=4F0F07C3.4070105%40ulaval.ca&forum_name=denovoassembler-users

454 PE-library information via newbler

pairDistanceRangeUsed = 5108..15324 computedPairDistanceAvg = 10216.7 computedPairDistanceDev = 2554.2

Protocol

1) Use convert-sff.sh XXXX.sff (gives XXXX.sff.OUT.fastq.Forward.fastq XXXX.sff.OUT.fastq.Reverse.fastq and XXXX.sff.OUT.fastq.Single.fastq

2) Feed in ill-pe (previously interleaved and f/r/s 454 reads. Ray command:

ill=XXXX_proc.fastq left=XXXX.sff.OUT.fastq.Forward.fastq right=XXXX.sff.OUT.fastq.Reverse.fastq single=XXXX.sff.OUT.fastq.Single.fastq

mpirun -np $NP Ray -p $left $right -i $ill -s $single -k 21 -o 21 -show-distance-summary

Output

Contigs.fasta CoverageDistributionAnalysis.txt CoverageDistribution.txt degreeDistribution.txt Library0.txt Library1.txt LibraryStatistics.txt NetworkTest.txt NumberOfSequences.txt Rank(0-15).RayContigPaths.txt RayCommand.txt RayVersion.txt SeedLengthDistribution.txt SequencePartition.txt

cat LibraryStatistics:

LibraryNumber: 0 InputFormat: TwoFiles,Paired DetectionType: Automatic File: XXXX.sff.OUT.fastq.Forward.fastq NumberOfSequences: 827693 File: XXXX.sff.OUT.fastq.Reverse.fastq NumberOfSequences: 827693 Distribution: 17/Library0.txt

LibraryNumber: 1 InputFormat: Interleaved,Paired DetectionType: Automatic File: XXXX_proc.fastq NumberOfSequences: 21689626 Distribution: 17/Library1.txt Peak 0 AverageOuterDistance: 214 StandardDeviation: 59

It appears that the auto detection of peaks has failed for the 454 data.

cat Library0.txt:

253 1 273 1 423 1 425 1 433 1 449 1 1131 1 1409 1 2146 1

More to come when jobs complete.

mscook commented 12 years ago

Hi Seb,

I deleted the Library.txt file from the run. I re-ran explicitly with the insert size and s.d. calculated from newbler (10 kb, 2 kb). I then hit the critical issue (just posted). I'll re-run these jobs on a single node and provide you with the required data.

Cheers

Mitch

mscook commented 12 years ago

Hi Seb,

Everything seems fine when - 1) I explicitly pass the 454 insert size metrics from newbler 2) Multiple nodes can access the input files

sebhtml commented 12 years ago

What is the content of SeedLengthDistribution.txt ?

Your Library0.txt indicates short seeds.

sebhtml commented 11 years ago

Ping

sebhtml commented 11 years ago

Will close as WONTFIX if the stakeholder does not report back. We need the LibraryX.txt file to fix the unit test. Otherwise, WONTFIX.