mahulchak / quickmerge

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
GNU General Public License v3.0
200 stars 31 forks source link

some issues #1

Closed ctxchris closed 8 years ago

ctxchris commented 9 years ago

Hi,

some things I noticed while trying quickmerge:

make_merger.sh has wrong compilation instructions should be "g++ -Wall -o quickmerge quickmerge.cpp qmergelib.cpp -I." instead of "g++ -Wall work_in_prog_temp.cpp exp_testlib.cpp -o merger"

MUMmer compilation might fail, because fodler aux_bin isn't created.

Running the quickmerge wrapper just prints all the scaffolds and contigs to stdout. The headers are printed twice, then the sequence itself.

Chris

jgbaldwinbrown commented 9 years ago

Hi Chris,

Thanks very much for the bug reports. I think they should all be taken care of now.

make_merger.sh was outdated. It has now been updated to simply call 'make' in the 'merger' directory

the MUMmer directory now contains the 'aux_bin' directory, so compilation should proceed smoothly.

merge_wrapper.py has also been updated. I removed two lines of debug code that printed the names of all input scaffolds, so there should be less unnecessary output to stdout. Apart from that, the quickmerge wrapper functions fine in my hands.

I was not able to replicate the error that you mentioned wherein the input fasta sequence is printed after the headers. Please try the new version, and send me your code if you still have this problem.

Did you mean to imply that the final output ("merged.fasta") was not created? If so, please send me your code so I can attempt to replicate it. The program seems to work fine on my end.

-Jim

ctxchris commented 9 years ago

Hi Jim,

thanks for the quick fix. Everything's working fine now. I was referring to the two debug lines that print the "oneline" version of the FASTAs, not the final output.

Chris

Update: Nucmer and delta-filter run succesfully and files "aln_summary.tsv" "summaryOut.txt" "anchor_summary.txt" were created. File "merged.fasta" however is empty and quickmerged crashed with: segfault at 0 ip 000000000040d203 sp 00007ffe50902880 error 4 in quickmerge[400000+24000]

mahulchak commented 9 years ago

Hi Chris, Thank for reporting the problem. Could you please check for me the following things:

  1. Is there any whitespace in your fasta sequence names?
  2. Does any of the two fasta files have line breaks in the sequences?e.g. are the sequences in lines of 60 or 80 characters? The python wrapper is supposed to take care of this but I wanted to rule it out before trying out other things. If answer to both of the.above is no then try to run quickmerge manually: quickmerge -d out.rq.delta -q query-file.fasta -r ref.fasta -hco 5.0 -c 1.5 -l any-number The reference file is the pb only fasta (or the fasta that you used as reference in mummer run) and the query file is the hybrid (or the fasta that was used as query in the mummer run). Let me know how it goes. Mahul
mahulchak commented 9 years ago

Also, could you please check if any of your fasta files has a sequence named ">ctg7180000002162" ? (Typically pbcr/celera assembly fasta files have such sequence names) Basically, you can do this grep 2162 foo.fasta And see if anything shows up. I also forgot to ask, does quickmerge print anything(like a chain of seq names) on the stdout before it crashes?

ctxchris commented 9 years ago

One file had line breaks in the sequence, the other had whitespace in the headers. I fixed both and run nucmer, delta-ffilter and quickmerge again. But I still get a segfault. The contigs of both files are named as ctg_X and contig_X with X being the contig count. That's why "...2162..." appears several times in the headers. The respective contigs also appear in the summary files. I noticed that the last entry of "aln_summary.tsv" "summaryOut.txt" and "anchor_summary.txt" is always contig_999999. quickmerge seems to crash while writing anchor_summary.txt. When I count the unique number of contigs for the reference and the query, aln_summary.txt and summaryOut.txt contain more entries than anchor_summary.txt. "merged.fasta" is not being written. I don't see anything written to stdout.

mahulchak commented 9 years ago

That's interesting. Would you be able to share your fasta files? I can try to find the source of the issue. Thank you. Mahul

On Sun, Oct 25, 2015, 10:33 AM Christian Dreischer notifications@github.com wrote:

One file had line breaks in the sequence, the other had whitespace in the headers. I fixed both and run nucmer, delta-ffilter and quickmerge again. But I still get a segfault. The contigs of both files are named as ctg_X and contig_X with X being the contig count. That's why "...2162..." appears several times in the headers. The respective contigs also appear in the summary files. I noticed that the last entry of "aln_summary.tsv" "summaryOut.txt" and "anchor_summary.txt" is always contig_999999. quickmerge seems to crash while writing anchor_summary.txt. When I count the unique number of contigs for the reference and the query, aln_summary.txt and summaryOut.txt contain more entries than anchor_summary.txt. "merged.fasta" is not being written. I don't see anything written to stdout.

— Reply to this email directly or view it on GitHub https://github.com/mahulchak/quickmerge/issues/1#issuecomment-150946762.

ctxchris commented 9 years ago

Unfortunately I can't share the fasta files. I run quickmerge on a subset of the data and got something printed to stdout before the coredump: ctg1049 ctg1049 1 ctg7580 -1
ctg106 ctg106 1 ctg7954 -1
ctg1067 ctg7102 1 ctg1067 -1
ctg1112 ctg468 1 ctg1112 1
ctg1114 ctg3334 1 ctg4 -1 ctg1114 1 ctg377 1 ctg13 1
ctg1116 ctg1669 1 ctg1116 -1
ctg1120 ctg81 1 ctg1120 -1 ctg2550 -1
ctg1123 ctg1506 1 ctg1123 1
ctg1126 ctg1126 1 ctg1984 1
ctg1132 ctg1132 1 ctg2047 -1 ctg5089 1
ctg1135 ctg513 1 ctg1135 -1
ctg1136 ctg1136 1 ctg3296 -1
ctg1137 ctg6991 1 ctg1137 1
ctg1138 ctg2751 1 ctg2236 1 ctg1138 1 ctg1820 -1
ctg1145 ctg1145 1 ctg663 1
ctg1147 ctg1147 1 ctg224 1
ctg1150 ctg6935 1 ctg1150 -1

These are the last couple of lines and the error message: ctg97 ctg1653 1 ctg97 -1
ctg99 ctg4536 1 ctg99 -1
ctg992 ctg992 1 ctg7535 1 terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr

Chris

mahulchak commented 9 years ago

Hi Chris, Without the fasta files, I can suggest you only a couple of things. CURRENT ERROR

  1. The error you got with your subset fasta files is different from the earlier error. Here are things that could have caused the coredump with the subset fasta files -- i) Your fasta files used for running nucmer and delta-filter are different from those used for quickmerge. e.g. if you use your orignal fasta files to run nucmer and then run quickmerge with the subset fasta files, you'll get a coredump. Please check that you are using the same fasta files for everything. ii) Please check that your quickmerge command has query and reference fasta files listed in the correct order. e,g, if you ran nucmer like nucmer foo1.fasta foo2.fasta you should run quickmerge as quickmerge -d out.rq.delta -q foo2.fasta -r foo1.fasta -hco 5.0 -c 1.5 -l 100000 and not quickmerge -d out.rq.delta -r foo1.fasta -q foo2.fasta -hco 5.0 -c 1.5 -l 100000 iii) I would have suggested removal of whitespace from fasta names and linebreaks in fasta sequences but I guess you have already taken care of that.

REGARDING THE PREVIOUS ERROR: i) How big is your genome and how much memory do you have in your machine? It is possible that your genome is too big for your memory.(ideally you'll need memory> 2*genome size) if you are on a Linux machine, you can use /usr/bin/time -v to know the peak memory usage by quickmerge. ii) Will you be able to recompile quickmerge with the -g flag so that you can run gdb? Once you do that, you can run quickmerge with the original dataset in gdb and then gdb will generate the debug info.