mahulchak / quickmerge

A simple and fast metassembler and assembly gap filler designed for long molecule based assemblies.
GNU General Public License v3.0
192 stars 31 forks source link

merged fasta is empty - where to look to troubleshoot? #50

Closed devonorourke closed 4 years ago

devonorourke commented 4 years ago

Hi Mahul,

Thanks for writing and maintaining a terrific program. I've used your software successfully on a previous genome by merging a pair of 10x and Nanopore assemblies. I thought I'd give it a shot on a different genome (from a different organism), but this time the output of quickmerge is an empty fasta file. It appears that the program has run without any errors, so I wasn't sure where to start troubleshooting (or how to interpret an empty fasta file).

I ran the following code using the wrapper:

QMERGEPY=/path/to/merge_wrapper.py
$QMERGEPY -pre mrgd nanopore.fasta hic.scaffolds.fasta

The following files were present in the directory where quickmerge was executed (after the job had completed):

-rw-r--r-- 1 dro49 cluster 705K Dec 16 22:51 anchor_summary_mrgd.txt
-rw-r--r-- 1 dro49 cluster 2.8M Dec 16 22:50 param_summary_mrgd.txt
-rw-r--r-- 1 dro49 cluster  15M Dec 16 22:50 aln_summary_mrgd.tsv
-rw-r--r-- 1 dro49 cluster    0 Dec 16 22:50 merged_mrgd.fasta
-rw-r--r-- 1 dro49 cluster  60M Dec 16 22:50 hic.rq.delta
-rw-r--r-- 1 dro49 cluster 1.1K Dec 16 22:50 qmerge.log
-rw-r--r-- 1 dro49 cluster 159M Dec 16 22:49 hic.delta
-rw-r--r-- 1 dro49 cluster 1.9G Dec 16 18:10 self_oneline.fa
-rw-r--r-- 1 dro49 cluster 1.9G Dec 16 18:08 nanopore.fasta
-rw-r--r-- 1 dro49 cluster 1.9G Dec 16 17:44 hic.scaffolds.fasta

Happy to share any further into that is useful for troubleshooting. Appreciate your help and insights,

Devon

mahulchak commented 4 years ago

@jgbaldwinbrown would you like to take a look at this? @devonorourke did you try running quickmerge manually? does that work?

devonorourke commented 4 years ago

Thanks for the quick reply @mahulchak. In looking through the log file of the initial run, the default parameters in the wrapper set the length cutoff for contig anchors to zero - is that expected? The documentation on the home page suggests choosing a value that reflects something close to the N50 of the self assembly (which, I believe, in my case reflects the N50 of the hic.scaffolds.fasta file, correct?).

This was the output from that initial .log file:

0       quickmerge
1       -d
2       mylu_HiCref.rq.delta
3       -q
4       LUr3p2.fasta
5       -r
6       self_oneline.fa
7       -hco
8       5.0
9       -c
10      1.5
11      -l
12      0
13      -ml
14      5000
15      -p
16      mylu_HiCref

I was curious if that N50 estimation, and therefore -l parameter value, should reflect the scaffold or contig N50 value of the assembly in question?

This is a very large difference in the case of the hic.scaffolds.fasta file: the scaffold N50 is over 95,000,000; the contig N50, however, is just over 64,000. For what it's worth, the N50 of the query fasta (LUr3p2.fasta) is just under 300,000. Of note, the query fasta are contigs, not scaffolds, while the reference fasta is a set of scaffolds, not contigs.

Thanks again for the help - I'm going to run quickmerge manually now with the original .delta file and see what happens using an -l value of 100000. I'd imagine something different than the previous value (0) should be produced 😕 ?

devonorourke commented 4 years ago

Minor update: The run no longer finishes, and instead produces a small error message: Segmentation fault. The summary .txt files are produced, but no merged fasta file is generated. Could there be something to the naming of the fasta headers that is causing this issue?

jgbaldwinbrown commented 4 years ago

Hi all, We set -l to 0 by default under the assumption that people would follow the documentation's instructions and set -l to a value particular to the input assemblies. I'm making -l into a required option in the wrapper to avoid this problem in the future. Should be fixed now. Mahul, let me know if you think there's a reason to have it the other way.

mahulchak commented 4 years ago

Thanks Jim. I think your changes are appropriate.

@devonorourke yes please look at the headers. if it still doesn't work (and if it is not an out of memory issue) then please let us know.

On Tue, Dec 17, 2019 at 10:17 PM James Baldwin-Brown < notifications@github.com> wrote:

Hi all, We set -l to 0 by default under the assumption that people would follow the documentation's instructions and set -l to a value particular to the input assemblies. I'm making -l into a required option in the wrapper to avoid this problem in the future. Should be fixed now. Mahul, let me know if you think there's a reason to have it the other way.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mahulchak/quickmerge/issues/50?email_source=notifications&email_token=ABZQH2AKBMJHTKXJNOHH7VLQZG535A5CNFSM4J3ZZCY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHE7RZA#issuecomment-566884580, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZQH2ECCAXFMPFIZNRJSS3QZG535ANCNFSM4J3ZZCYQ .

-- Mahul Chakraborty Department of Ecology and Evolutionary Biology University of California-Irvine Phone: 949 824 9559 Fax: 949 824 9559 Website: https://mahulchakraborty.wordpress.com/ Github: https://github.com/mahulchak

devonorourke commented 4 years ago

Hi all, I made a single change: removing instances of a second underscore in the headers of the fasta files. Note that the two input files for merging did not share any similar names, yet one of the two fasta files did contain a header name like so:

>contig_HiC_1
aaaaaaaaaa
>contig_HiC_2
cccccccccc

While the documentation appears to suggest that you can't have any whitespace in the header names (and these didn't so far as I could tell), it looks like by eliminating that second underscore in the header name the program ran just fine. The new headers look like this:

>contigHiC_1
aaaaaaaaaa
>contigHiC_2
cccccccccc

many thanks