No such table: Fragments when using MosaikSort and duplicate removal

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1.MosaikSort to remove the duplicate fragment
2.
3.

What is the expected output? What do you see instead?
./MosaikDupSnoop -in Test94N.bin.aligned -od fragData/

MosaikDupSnoop 1.0.1307                                             2009-
10-14
Michael Stromberg                 Marth Lab, Boston College Biology 
Department
---------------------------------------------------------------------------

---

- resolving the following types of read pairs: [unique orphans] [unique vs 
unique] [unique vs multiple]

Scanning the following alignment archives:
- Test94N.bin.aligned

Databases for the following libraries will be created:
- XSU94N

Creating databases... finished.

Parsing Test94N.bin.aligned:
- phase 1 of 2: building fragment length distribution:
samples: 1,000,000 (36,959.0 samples/s)

- phase 2 of 2: record fragment lengths: (307 - 504 bp)
100%
[==========================================================================
=========]  12,779.6 reads/s       in 12:34

Consolidating fragments from library: XSU94N
- performing first consolidation SQL query... 386.246 s.
- consolidating results... 353.609 s.
- performing second consolidation SQL query... 341.566 s.
- consolidating results... 243.312 s.

MosaikDupSnoop CPU time: 768.320 s, wall time: 2193.908 s

./MosaikSort -in Test94N.bin.aligned -out Test94N.bin.aligned.sorted -dup 
fragData/

MosaikSort 1.0.1307                                                 2009-
10-14
Michael Stromberg                 Marth Lab, Boston College Biology 
Department
---------------------------------------------------------------------------

---

- resolving the following types of read pairs: [unique orphans] [unique vs 
unique] [unique vs multiple]
- enabling duplicate filtering
- resolving paired-end alignments
- retrieving read names from the duplicate database... 
ERROR: The SQL query resulted in the following error: no such table: 
Fragments

What version of the product are you using? On what operating system?
centOS 64-bit linux cluster.

Please provide any additional information below.

Original issue reported on code.google.com by xiaoping...@stjude.org on 18 Oct 2009 at 5:22

GoogleCodeExporter commented 8 years ago

Hi Xiaoping, 

I'll take a look at the fragments error tomorrow.

Cheers,

// Michael

Original comment by snowneb...@gmail.com on 18 Oct 2009 at 5:38

Changed title: No such table: Fragments when using MosaikSort and duplicate removal
Changed state: Started

GoogleCodeExporter commented 8 years ago

Dear Michael:

The bug was found! Specifically, your MosaikDupSnoop generated a SQLite3 
database 
with the four tables:
OrphanFragments  PairedFragments  ReadGroups       SingleFragments

However, MosaikSort was looking for the table "Fragments" (it should look 
for "PairedFragments") in duplicate filtering. This was why the error message 
is "no 
such table: Fragments". When I manually created a table "Fragments" 
from "PairedFragments" in the database, MosaikSort was able to remove duplicate 
fragments.

However, I still hope to remind you that your MosaikDupSnoop has been extremely 
and 
weirdly slow to generate the database with 7 lanes illumina data (about 50M 
paired-
end reads). Indeed, I have to abort the MosaikDupSnoop after more than 48 
hours. 
Thus, it would be very helpful if you could take a look at the speed issue of 
your 
MosaikDupSnoop since duplicate removal is an extremely important issue for SNP 
and 
Indel detection.

I appreciate your great work and help greatly!

Xiaoping

Original comment by xiaoping...@stjude.org on 23 Oct 2009 at 2:52

GoogleCodeExporter commented 8 years ago

Thanks Xiaoping,

I had forgotten to make the necessary changes after one of my 1000 Genomes 
project
tests. Thanks for reminding me - I'll make the necessary changes today.

The sqlite3 database is very disk I/O intensive. So the only recommendation I 
can
make is to try running it on fast, local hard disks as opposed to network 
storage.

MosaikDupSnoop was designed this way so that it could take into consideration an
entire directory of Mosaik alignment archives. I could make a version that 
handles
one specific file (perhaps merged) in a much quicker way. I'll look into it 
during
the weekend.

Thanks!

// Michael

Original comment by snowneb...@gmail.com on 23 Oct 2009 at 3:00

GoogleCodeExporter commented 8 years ago

Dear Michael:

It's great to know that the slow speed was caused by network storage. I will 
definitely try running it on local fast disks today.

However, it's very hard to get around the network storage once more samples are 
getting sequenced. Thus, I will really appreciate it if you could make a 
version 
that handles one specific merged file in a much quicker way.

Again, I hope to express my sincere gratitude to you for your help!

Xiaoping

Original comment by xiaoping...@stjude.org on 23 Oct 2009 at 3:21

GoogleCodeExporter commented 8 years ago

Michael and Xiaoping,
When I try to use DupSnoop to inspect a specific alignment archive it does not 
generate a "sequencing library". After reviewing these posst I am obviously 
doing 
something wrong(no surprise). I either don't understand what is meant by 
"sequencing 
library" in this context or something else is going on, specifically when 
dupsnoop 
states "Databases for the following libraries will be created:" there are no 
databases created.
I have pasted the the command line I used for dupsnoop:

 MosaikDupSnoop -in 33105n.bin.aligned -od fragData/
------------------------------------------------------------------------------
MosaikDupSnoop 1.0.1307                                             2009-10-14
Michael Stromberg                 Marth Lab, Boston College Biology Department
------------------------------------------------------------------------------

- resolving the following types of read pairs: [unique orphans] [unique vs 
unique] 
[unique vs multiple] 

Scanning the following alignment archives:
- 33105n.bin.aligned

Databases for the following libraries will be created:
- 

Creating databases... finished.

Parsing 33105n.bin.aligned:
- recording unique read lengths:
100%[=====================================]  53,401.1 reads/s       in 01:29  

Consolidating fragments from library: 
- no paired-end fragments found. skipping library.

Thanks,
Jeff

Original comment by jstevens...@gmail.com on 23 Oct 2009 at 4:34

GoogleCodeExporter commented 8 years ago

Hi, Jeff:

Your problem was caused by how you used MosaikBuild. Specifically, you have to 
specify the library name (-ln) when you try to build the binary reads file by 
using 
MosaikBuild. Without "library name", DupSnoop has trouble creating a database 
file 
name in the folder (e.g. fragData).

Cheers!

Xiaoping

Original comment by xiaoping...@stjude.org on 23 Oct 2009 at 5:19

GoogleCodeExporter commented 8 years ago

Xiaoping,
Thanks! 
I'll try that.

Jeff

Original comment by jstevens...@gmail.com on 23 Oct 2009 at 5:27

GoogleCodeExporter commented 8 years ago

Dear Michael:

I ran MosaikDupSnoop on 64-bit Window PC with 32GB memory, 500GB local hard 
disk, 
and 4 processors. And I was able to successfully perform DupSnoop for 7 lanes 
transcriptome illumina data (the alignment file size is about 9GB with 
single-end 
reads setting, and there are lots of PCR artifacts from RNA-seq) within three 
hours, 
which is perfectly fine. The sqlite3 database file size is about 5GB.

However, when I ran DupSnoop for 14 lanes genome illumina data (the alignment 
file 
size is about 20GB with single-end reads setting with alignment mode=unique) on 
same 
64-bit PC, DupSnoop was not able to finish. Specifically, DupSnoop was stuck 
with 
0.000 reads/s after DupSnoop generated a sqlite3 database file with size=14GB 
and 
90% alignment file getting analyzed.

Thanks very much!

Xiaoping

Original comment by xiaoping...@stjude.org on 28 Oct 2009 at 2:34

GoogleCodeExporter commented 8 years ago

Hi Xiaoping,

I'll try to look into the performance issue of DupSnoop at a later date.

Since the original bug was fixed with the table names, I'll close this bug 
report.

Cheers,

// Michael

Original comment by snowneb...@gmail.com on 16 Jan 2010 at 2:09

Changed state: Fixed

guanchangge / mosaik-aligner

No such table: Fragments when using MosaikSort and duplicate removal #4