medvedevgroup / TwoPaCo

A fast constructor of the compressed de Bruijn graph from many genomes
Other
40 stars 10 forks source link

Corrupt input when using graphdump on TwoPaCo output #9

Closed rob-p closed 7 years ago

rob-p commented 7 years ago

Hi guys,

First of all; awesome work on TwoPaCo --- the method and the software are both fantastically useful, and I'm really excited to start using it for some downstream applications we're working on. I'm running into the following issue. I used TwoPaCo to build the compacted dBG for the human transcriptome (the following command):

twopaco  -k 31 -t 8 -f 32 gencode.v25.pc_transcripts.fa

As the name suggests, the reference is protein coding human transcripts from gencode v25. This seems to work fine, and I get the following output from TwoPaCo:

Threads = 8
Vertex length = 31
Hash functions = 5
Filter size = 4294967296
Capacity = 2
Files:
/mnt/scratch6/avi/data/txptome/gencode.v25.pc_transcripts.fa
--------------------------------------------------------------------------------
Round 0, 0:4294967296
Pass    Filling Filtering
1       9       16
2       3       1
True junctions count = 358144
False junctions count = 56685
Hash table size = 414829
Candidate marks count = 2551662
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 2540661
Edges construction time: 6
--------------------------------------------------------------------------------
Distinct junctions = 358144

Now, I want to convert this output to a GFA format (I tried both GFA1 and 2 and get the same error in each case). I used the following command:

graphdump -k 31 -s gencode.v25.pc_transcripts.fa -f gfa1 de_bruijn.bin > gencode.twopaco.gfa1

This results in the following error message:

error: The input is corrupted

At this point, some output has been generated, but I presume it's not complete because, despite the fact that there are ~96k input transcripts, I only get 35,451 output paths (i.e., P) entries in the resulting GFA file. Any idea what might be causing this issue or how to fix it?

Thanks! Rob

iminkin commented 7 years ago

Hi Rob, glad that you find TwoPaCo useful. I will take a look shortly and let you know.

iminkin commented 7 years ago

Are there any sequences shorther than 31 in the input file?

rob-p commented 7 years ago

Hi Ilyan,

There are 11 such sequences in this file. I understand that, clearly, they cannot be represented in the 31-mer dBG. Do I need to remove them before feeding the file to TwoPaCo? I'll check to see if this resolves the graphdump issue.

Thanks, Rob

rob-p commented 7 years ago

@ilyaminkin,

Indeed, this resolves the issue, and graphdump runs to completion now without error. Thanks for pointing this out. I should have checked for such sequences before processing the file. Might I make the suggestion that TwoPaCo emit a warning when it encounters such records, to notify the user that they might cause issues?

Thanks! Rob

iminkin commented 7 years ago

@rob-p Not handling those sequences properly is clearly a bug on TwoPaCo's side. Thank you for reporting it. I am not sure about the proper program's behaviour in that case. Probably, ignoring those and showing a warning to the user, as you suggested.