interpretation of graph build log

roddypr commented 7 years ago

Hi,

I'm running mccortex using make-pipeline.pl.

I have an example where I have three batches of paired-end sequences for the same individual. I've noticed that the log file for the initial graph building says that all reads are read as single-ended reads. @yeban has seen the same issue in his runs of mccortex as well.

Is this correct?

Best wishes,

Roddy

[12 May 2017 12:06:53-JaC][seq] Loaded 5,082,951 reads and 0 reads pairs (file: input/batch1/run3/WTCHG_295747_705501_1.fastq.gz)
[12 May 2017 12:06:58-JaC][seq] Loaded 5,082,951 reads and 0 reads pairs (file: input/batch1/run3/WTCHG_295747_705501_2.fastq.gz)
[12 May 2017 12:07:37-JaC][seq] Loaded 5,680,996 reads and 0 reads pairs (file: input/batch1/run2/WTCHG_297318_705501_1.fastq.gz)
[12 May 2017 12:07:42-JaC][seq] Loaded 5,680,996 reads and 0 reads pairs (file: input/batch1/run2/WTCHG_297318_705501_2.fastq.gz)
[12 May 2017 12:07:44-JaC][seq] Loaded 5,806,822 reads and 0 reads pairs (file: input/batch1/run1/WTCHG_283923_705501_1.fastq.gz)
[12 May 2017 12:07:46-JaC][seq] Loaded 5,806,822 reads and 0 reads pairs (file: input/batch1/run1/WTCHG_283923_705501_2.fastq.gz)
[12 May 2017 12:07:46-JaC][hasht] buckets: 67,108,864 [2^26]; bucket size: 48;
[12 May 2017 12:07:46-JaC][hasht] memory: 24.1GB; filled: 615,427,529 / 3,221,225,472 (19.11%)
[12 May 2017 12:07:46-JaC][hasht]  collisions  0: 615427529
[12 May 2017 12:07:46-JaC][task] input: input/batch1/run1/WTCHG_283923_705501_1.fastq.gz colour: 0
[12 May 2017 12:07:46-JaC]  SE reads: 33,141,538  PE reads: 0
[12 May 2017 12:07:46-JaC]  good reads: 33,139,955  bad reads: 1,583
[12 May 2017 12:07:46-JaC]  dup SE reads: 0  dup PE pairs: 0
[12 May 2017 12:07:46-JaC]  bases read: 4,971,230,700  bases loaded: 4,730,385,634
[12 May 2017 12:07:46-JaC]  num contigs: 34,917,549  num kmers: 3,682,859,164 novel kmers: 615,427,529
[12 May 2017 12:07:46-JaC][task] input: input/batch1/run1/WTCHG_283923_705501_2.fastq.gz colour: 0
[12 May 2017 12:07:46-JaC]  SE reads: 0  PE reads: 0
[12 May 2017 12:07:46-JaC]  good reads: 0  bad reads: 0
[12 May 2017 12:07:46-JaC]  dup SE reads: 0  dup PE pairs: 0
[12 May 2017 12:07:46-JaC]  bases read: 0  bases loaded: 0
[12 May 2017 12:07:46-JaC]  num contigs: 0  num kmers: 0 novel kmers: 0

The entries for the remaining input files all say SE reads: 0 PE reads: 0.

roddypr commented 7 years ago

Hi, I wonder if you've had time to take a look at this issue! Best wishes, Roddy

benjeffery commented 7 years ago

To use the information from paired reads you need to use threading. See "Read Threading" on this page: https://github.com/mcveanlab/mccortex/wiki/Workflow-Examples#ReadThreading

yeban commented 7 years ago

Thanks @benjeffery. So, paired reads are treated as single end reads during graph construction, right? Then, why not build two graphs using --seq option: one for forward and one for reverse reads, instead of specifying both with --seq2 option? Is this information stored in the graph somehow and used in the read threading step?

noporpoise commented 6 years ago

Apologies for the late response. McCortex graph construction is a multi-step process. make-pipeline.pl is designed to run all these steps for you. The steps are:

building a traditional de Bruijn graph (i.e. kmer graph)
error cleaning the graph
threading reads through the graph (as single reads) to create graph links
error cleaning the graph links
threading reads through the graph (as pairs of reads and filling in the insert gap) to create longer graph links
error cleaning the graph links

The final output is a de bruijn graph (.ctx file) and a corresponding links file (.ctp file)

I'm unclear which step the output you posted is from. The graph construction has no notion of forward/reverse reads - I may have misunderstood what you meant there. Hopefully I've answered your questions - please re-open the issue if not.

yeban commented 6 years ago

Thanks, @noporpoise. Yes, that answers our questions. The output we posted was from mccortex build command.

mcveanlab / mccortex

interpretation of graph build log #58