mcveanlab / mccortex

De novo genome assembly and multisample variant calling
https://github.com/mcveanlab/mccortex/wiki
MIT License
113 stars 25 forks source link

interpretation of graph build log #58

Closed roddypr closed 6 years ago

roddypr commented 7 years ago

Hi,

I'm running mccortex using make-pipeline.pl.

I have an example where I have three batches of paired-end sequences for the same individual. I've noticed that the log file for the initial graph building says that all reads are read as single-ended reads. @yeban has seen the same issue in his runs of mccortex as well.

Is this correct?

Best wishes,

Roddy

[12 May 2017 12:06:53-JaC][seq] Loaded 5,082,951 reads and 0 reads pairs (file: input/batch1/run3/WTCHG_295747_705501_1.fastq.gz)
[12 May 2017 12:06:58-JaC][seq] Loaded 5,082,951 reads and 0 reads pairs (file: input/batch1/run3/WTCHG_295747_705501_2.fastq.gz)
[12 May 2017 12:07:37-JaC][seq] Loaded 5,680,996 reads and 0 reads pairs (file: input/batch1/run2/WTCHG_297318_705501_1.fastq.gz)
[12 May 2017 12:07:42-JaC][seq] Loaded 5,680,996 reads and 0 reads pairs (file: input/batch1/run2/WTCHG_297318_705501_2.fastq.gz)
[12 May 2017 12:07:44-JaC][seq] Loaded 5,806,822 reads and 0 reads pairs (file: input/batch1/run1/WTCHG_283923_705501_1.fastq.gz)
[12 May 2017 12:07:46-JaC][seq] Loaded 5,806,822 reads and 0 reads pairs (file: input/batch1/run1/WTCHG_283923_705501_2.fastq.gz)
[12 May 2017 12:07:46-JaC][hasht] buckets: 67,108,864 [2^26]; bucket size: 48;
[12 May 2017 12:07:46-JaC][hasht] memory: 24.1GB; filled: 615,427,529 / 3,221,225,472 (19.11%)
[12 May 2017 12:07:46-JaC][hasht]  collisions  0: 615427529
[12 May 2017 12:07:46-JaC][task] input: input/batch1/run1/WTCHG_283923_705501_1.fastq.gz colour: 0
[12 May 2017 12:07:46-JaC]  SE reads: 33,141,538  PE reads: 0
[12 May 2017 12:07:46-JaC]  good reads: 33,139,955  bad reads: 1,583
[12 May 2017 12:07:46-JaC]  dup SE reads: 0  dup PE pairs: 0
[12 May 2017 12:07:46-JaC]  bases read: 4,971,230,700  bases loaded: 4,730,385,634
[12 May 2017 12:07:46-JaC]  num contigs: 34,917,549  num kmers: 3,682,859,164 novel kmers: 615,427,529
[12 May 2017 12:07:46-JaC][task] input: input/batch1/run1/WTCHG_283923_705501_2.fastq.gz colour: 0
[12 May 2017 12:07:46-JaC]  SE reads: 0  PE reads: 0
[12 May 2017 12:07:46-JaC]  good reads: 0  bad reads: 0
[12 May 2017 12:07:46-JaC]  dup SE reads: 0  dup PE pairs: 0
[12 May 2017 12:07:46-JaC]  bases read: 0  bases loaded: 0
[12 May 2017 12:07:46-JaC]  num contigs: 0  num kmers: 0 novel kmers: 0

The entries for the remaining input files all say SE reads: 0 PE reads: 0.

roddypr commented 7 years ago

Hi, I wonder if you've had time to take a look at this issue! Best wishes, Roddy

benjeffery commented 7 years ago

To use the information from paired reads you need to use threading. See "Read Threading" on this page: https://github.com/mcveanlab/mccortex/wiki/Workflow-Examples#ReadThreading

yeban commented 7 years ago

Thanks @benjeffery. So, paired reads are treated as single end reads during graph construction, right? Then, why not build two graphs using --seq option: one for forward and one for reverse reads, instead of specifying both with --seq2 option? Is this information stored in the graph somehow and used in the read threading step?

noporpoise commented 6 years ago

Apologies for the late response. McCortex graph construction is a multi-step process. make-pipeline.pl is designed to run all these steps for you. The steps are:

The final output is a de bruijn graph (.ctx file) and a corresponding links file (.ctp file)

I'm unclear which step the output you posted is from. The graph construction has no notion of forward/reverse reads - I may have misunderstood what you meant there. Hopefully I've answered your questions - please re-open the issue if not.

yeban commented 6 years ago

Thanks, @noporpoise. Yes, that answers our questions. The output we posted was from mccortex build command.