mcveanlab / mccortex

De novo genome assembly and multisample variant calling
https://github.com/mcveanlab/mccortex/wiki
MIT License
113 stars 25 forks source link

Links files are really big #64

Closed winni2k closed 6 years ago

winni2k commented 6 years ago

My links files are as big as my input fastq files.

I took a look at one of my links files and I saw this:

#   written by Isaac Turner <turner.isaac@gmail.com>
#   url: https://github.com/mcveanlab/mccortex
# 
# Comment lines begin with a # and are ignored, but must come after the header
# Format is:
#   [kmer] [num_paths] ...(ignored)
#   [FR] [num_juncs] [counts0,counts1,...] [juncs:ACAGT] [seq=... juncpos=... ...]
#
# Columns are separated by a single space.
# Columns 1-4 are required ([FR],..,[juncs]) everything after than is optional

CAAAGCAGCCTTTGCTGAACCTTCATATTGTAGCCCTATTCTTAAGC 87
R 1 4 A seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAAA juncpos=5
R 1 42 T seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAAT juncpos=5
R 2 30 TG seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAG juncpos=5,14
R 3 2 TGA seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGA juncpos=5,14,15
R 4 14 TGAA seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAA juncpos=5,14,15,16
R 5 12 TGAAC seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACAC juncpos=5,14,15,16,19
R 6 30 TGAACT seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACT juncpos=5,14,15,16,19,20
R 7 39 TGAACTT seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCT juncpos=5,14,15,16,19,20,23
R 8 41 TGAACTTT seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCTGGTGTT juncpos=5,14,15,16,19,20,23,29
R 9 22 TGAACTTTT seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCTGGTGTTCTGTGT juncpos=5,14,15,16,19,20,23,29,35
R 10 8 TGAACTTTTT seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCTGGTGTTCTGTGTTT juncpos=5,14,15,16,19,20,23,29,35,37
R 11 5 TGAACTTTTTA seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCTGGTGTTCTGTGTTTAA juncpos=5,14,15,16,19,20,23,29,35,37,39
R 12 22 TGAACTTTTTAT seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCTGGTGTTCTGTGTTTAAT juncpos=5,14,15,16,19,20,23,29,35,37,39,40
R 14 4 TGAACTTTTTATAG seq=GCTTAAGAATAGGGCTACAATATGAAGGTTCAGCAAAGGCTGCTTTGTATAATGTATCTGAGAACACTTCTGGTGTTCTGTGTTTAATAG juncpos=5,14,15,16,19,20,23,29,35,37,39,40,41,42

Kiran tells me seq and juncpos are not strictly necessary. Would it be possible to add an argument to thread to suppress the output of these annotations?

noporpoise commented 6 years ago

Are you threading through a cleaned de bruijn graph? What species and sequencing platform are you using?

seq= is the sequence described by the link and it can be calculated from the kmer graph + junction choices in column 4. It is currently required by the links command when cleaning links. Cleaning links reduces the file size, from which you can strip out the optional columns to further reduce final file size. If I were to re-write several bits of McCortex we could avoid adding seq= lines to links files, but it would be a lot of work for a small payoff in disk space. I can't imagine many people are generating links that they are not going to then clean. Cleaning is important!

Also worth noting is that increases in kmer size will exponentially decreases the number of links.

winni2k commented 6 years ago

Right. Thanks.