lh3 / fermi

A WGS de novo assembler based on the FMD-index for large genomes
75 stars 16 forks source link

fermi hangs on a very small dataset #4

Open ctSkennerton opened 11 years ago

ctSkennerton commented 11 years ago

I've run fermi on a very small dataset containing 22 fasta records using the following cmd:

run-fermi.pl -k 200 -p cdhitout_0.85 <reads.fa>  | make -f -

however fermi hangs indefinitely. When I look at top I can see that fermi ropebwt is constantly in the sleep state:

45288 uqcskenn  20   0 24188  740  584 S    3  0.0   1:08.84 fermi ropebwt -a bcr -v3 -btf cdhitout_0.85.ec.tmp -                                                                                         
45447 uqcskenn  20   0 24188  740  584 S    2  0.0   1:08.00 fermi ropebwt -a bcr -v3 -btf cdhitout_0.90.ec.tmp - 

I've tried using both the git HEAD and with release 1.1

<reads.fa> contains:

>M00920:10:000000000-A292A:1:1101:2305:13136:1
CTTCTGGTGAAACCCACTCCCATGGTGTGACGGGCGGTGTGTACAAGACCCGGGAACGTATTCACCGCGACATGCTGATCCGCGATTACTAGCGATTCCGACTTCACGCAGTCGAGTTGCAGACTGCGATCCGGACTACGATCGGCTTTGTGAGATTCGCTCCGCCTCGCGGCTTGGCAACCCTCTGTACCGACCATTGTATGACGTGTGAAGCCCTACCCATAAGGGCCATGAGGACTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCTCGTTAAAGTGCCCAACCAAATGATGGCAATTAACGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACAT
>M00920:10:000000000-A292A:1:1101:24216:16298:1
CCCTTATCCTTAGTTACCAGCACCTCGGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGAGGAGGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATCAGAATGTCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCATCACACCATGGGAGTGGGTTGCTCCAGAAGTAGCTAGTCTAACCGCAAGGGGGACGGTTACCA
>M00920:10:000000000-A292A:1:1110:4340:7240:1
CAGATTGAACGCTGGCGGCATGCTTTACACATGCAAGTCGAACGGCAGCGGGGGCTTCGGCCCGCCGGCGAGTGGCGAACGGGTGAGTAATGCATCGGAACGTACCCATGTTGTGGGGGATAACGTAGCGAAAGCTACGCTAATACCGCATAAGCCCTGAGGGGGAAAGCGGGGGATTCTTCGGAACCTCGCGCAATTGGAGCGGCCGATGTCAGATTAGCTAGTTGGTAGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCGGACTCCTCCGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGCAAGGGTGATC
>M00920:10:000000000-A292A:1:1110:21042:16009:1
ACCCAGGGGGCTGCCTTCGCCATCGGTGTTCCTCCACATCTCTACGCATTTCACTGCTACACGTGGAATTCCACCCCCCTCTGCCACACTCGAGCCTTGCAGTCACAAACGCATTTCCCAGGTTAAGCCCGGGGATTTCACATCTGTCTTACAAAGCCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGGTGCTTGTTCTTCAGTTCCCGTCATTGACAGTCTATGTTAGACCCCGCCGTTTCGTTCCTGCCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACGCGGAATGGCTGGATCAGGGT
>M00920:10:000000000-A292A:1:1101:19922:4365:1
ATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCATGAGCGACAGACCAGGTCCAGGGGGCTGCCTTCGCCTTCGATGTTCCTCCTGATATCTACGTATTTCACTGCTACACCCGGATTTCCACCCCCCTCTACCGCACTCTAGGCACACAGTCACAAACGCATTTCCCAGGTTAAGCCCGGGGGTTTCAAATCTGAATTATTTAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTCGGTATGACCGCGACTGCCAGCGGGTAGGAAGGCGGTACTTTTTATTCCGGTGCCGACATCCTCCCCGGATATTCACCGCGGCTATTTCTTTCCGTCCGACAGAGGTGTAAAACCCGAAGGCGAGCTTG
>M00920:10:000000000-A292A:1:1101:18095:13295:1
GGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGGAAGCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCGGTGGGGAAGAAATTGCACGGGTTAATACCCTGTGTAGATGACGGTACCCGACTAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGGTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAGACTGCCAAGCTGGAGTGTGGCAGAGGGGGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATCAGGAG
>M00920:10:000000000-A292A:1:2102:3086:14182:1
GTAGTGACCCAGGGGGCTGCCTTCGCCATCGGTGTTCCTCCACATCTCTACGCATTTCACTGCTACACGTGGAATTCCACCCCCCTCTGCCACACTCCAGCCTGGCAGTCTCAAATGCAGTTCCCAGGTTGAGCCCGGGGCTTTCACATCTGACTTACCAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTATTAACGCGGCTGCTGGCACGTAGTTCGCCGGTGCTTCTTAGTCGGGTACCGTCATCTACACAGGATATTAGCCCGTGCAATTTCTTCCCCACCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACGCGGCATGGCTGGATCAGGCTTCCGCCC
>M00920:10:000000000-A292A:1:2108:13711:22806:1
GATTAAACGCTGGCGGCATGCCTTACACATGCAAGTCGAACGGCAGCACGGGGGCAACCCTGGTGGCGAGTGGTGGACGGGTGAGTAAAGCATCGGAACGTATCCTGAAGTGGAGTATAACGTAGCGAAAGTTACGCTAATACCGCATAGTCTGTGAGCAGGAAAGCAGGGGATCGCAAGACCTTGCGCTCTGGGAGCGGCCGATGTCGGATTAGCTAGTTGGGGGGGTAAAGGCCTACCAAGGCGCGGCTCCGTAGCGGGGATTGGAGTATGAAACGCCACACTGTGACTGAGAAACGGCCCGGACTCCTACGTGAGGAAGCAGCGGTGAATTTTTTCCAATGGGTTCAAGCC
>M00920:10:000000000-A292A:1:2110:11377:9313:1
GCATCGGAACGTGCCCTGGAATGGGGGATAACGTAGCGAAAGTTACGCTAATACCGCATATTCTGTGAGCAGGAAAGCAGGGGATCGCAAGACCTTGCGTTCTGGGATCGGCCGATGTCGTATGAGCTAGTTGGTGGGGAAAAGGCCTACCACGGCGACGATCCGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCCGTGGGGAATTTTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCGGTGGGGAAGAAATTGCATGGGTTAATTCCC
>M00920:10:000000000-A292A:1:1105:17264:25408:1
GAATTACTGGGCGTAAAGCGTGCGCAGGCGGCGCCATAAGACAGACGTGAAATCCCCGGGCTTAACCTGGGAACTGCGTTTGTGACTGTGGTGCTCGAGTGTGGCAGAGGGGGGTGGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAACACCGATGGCGAAGGCAGCCCCCTGGGTCAACACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGCGAACTAGGTGTTGGGGAAGGAGACGTTCTTAGTACCGCAGCTAACGCGTGAAGTTCGCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATGGACA
>M00920:10:000000000-A292A:1:2105:19316:26848:1
ATCCGTAGCTGGTCTGAGAGGACGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATTCCGCGTGAGTGAAGAAGGCCTTCGGGTTGTAAAGCTCTTTCAGCAGGAACGAAACGGCTCTCTCTAACATAGGGAGTTAATGACGGTACCTGAAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCACAGGCGGCGCCATAAGACAGATGTGAAATCCCCGGGCTTAACCTGGGAAC
>M00920:10:000000000-A292A:1:1111:13173:15398:1
TGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACAGAACTTGCCAGAGATGGCTTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCACCGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTTCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGCTGAAGTCAAGTCATCATGGCCCTTATGGGTAGGGCGTCACACGTCATACAATGGTCGGAACAGAGGGTTGCCAAGCCGCGAGGTGGAGCCAATCCCAGAAAACCGATCGTAGTCCGGATCGC
>M00920:10:000000000-A292A:1:1102:8010:26367:1
GCCTTACACATGCAAGTCGAACGGCAGCGGAACTTCGGGTGCCGGCGAGTGGCGAACGGGTGAGTAATGCATCGGAACGTGCCATTGAGTGGGGGATAACGTAGCGAAAGTTGCGCTAATACCGCATATTCTGTGAGCAGGAAAGCAGGGGACCGCAAGGCCTTGCGCTCTTTGAGCGGCCGATGTCAGATTAGCTAGTTGGTGAGGTAAAGGCTTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGTGAAGAAGGCCTTCGGGT
>M00920:10:000000000-A292A:1:1106:8344:21464:1
GTTCCTACCATTGTAGCACGTGTGTAGCCCTGGGCATAAAGGCCATGATGACTTGACATCATCCCCTCCTTCCTCGCGTCTTACGACGGCAGTTTCTTTAGAGTTCCCAGCTTAACCTGTTGGCAACTAAAGATAGGGGTTGCGCTCGTTGCGGGACTTAACCCAACACCTCACGGCACGAGCTGACGACAGCCATGCAGCACCTGTGTGACGGCTCCCTTTCGGGCACCCTCAACTCTCATCGAGGTTCCGTCCATGTCAAGGGTAGGTAAGGTTTTTCGCGTTGCATCGAATTAATCCACATCATCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTTAATC
>M00920:10:000000000-A292A:1:1109:11262:3539:1
TTTACCCACCCAACACCTAGTTGACATAGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTACCCACGCTTTCGTGCATGAGCGTCAGTATCGGCCCAGGGGGCTGCCTTCGCCATAGGTGTTCCTCCCCATCTCTACGCTTTTCACTGCTACACGTGGAATTCCACCCCCCTCTGCCGTACTCTAGTGAGGCAGTCACAAACGCAGTTCCCAGGTTACGCCCGGGGATTTCACGCCTGTCTTACCAATCCGCCTGCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGGTGCTTCTTATGCCGGTACCG
>M00920:10:000000000-A292A:1:1113:21063:11515:1
ACACAGGGTATTAACCCATGCGATTTCTTCCCGGCCGAAAGAGCTTTACAACCCGAAGGCCTTCTTCACTCACGCGGCATGGCTGGATCAGGGTTGCCCCCATTGTCCAAAATTCCCCACTGCTGCCTCCCGGAGGAGTCTGGCCCGTGTCTCAGTTCCAGTGTGGCGGATCATCCTCTCAGACCCGCTCCAGATCGTCGCCTTGGTAAGCCGTTACCTCACCAACTAGCTAATCTGACATAGGCCGCTCAAAGAGCGCAAGGCCTTGCGGTCCCCTGCTTTCCTGCTCACAGAATATGCGGTATTAGCGCAACTTTCGCTACGTTATCCCCCACTCAATGGCACGTTCCGATGCATTACTCACC
>M00920:10:000000000-A292A:1:2109:18065:11577:1
CCTTTGTATTGTCCATTGTAGCACGTGTGTAGCCCAAATCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCGGTTTGTCACCGGCAGTCAACTTAGAGTGCCCAACTTAATGATGGCAACTAAGCTTAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCAGTGTGACGGCTCCCTTTCGGGCACCCTCAACTCTCATCGAGGTTCCGTCCATGTCAAGGGTAGGTAAGGTTTTTCGCGTTGCATCGAATTAATCCACATCATCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTTAATC
>M00920:10:000000000-A292A:1:2113:10809:18271:1
GTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGATTAATTCGATGCAACGCGAAAAACCTCACCTACCCTTGACATGGACGGAACCTCGATGAGAGTTGAGGGTGCCCGAAAGGGAGCCGTCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGCTTAGTTGCCATCATTAAGTTGGGCACTCTAAGTTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAA
>M00920:10:000000000-A292A:1:2101:18998:6292:1
GTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACAGAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAAGGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGGGTAGGGCTTCACACGTCATACAATGGTCGGAACAGAGGGTTGCCAAGCCGCGAGGTGGAGCCAATCCCAGAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGAC
>M00920:10:000000000-A292A:1:2108:17778:22051:1
ATCCACAGAACTTAGCAGAGATGCTTTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGTAACGAGCGCAACCCTTGTCCTTAGTTACCAGCACCTCGGGTGGGCACTCTAAGGAGACTGCCGGTGACAAACCGGGGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCCAAGCCGCGAGGTGGAGCTAATCCCATAAAACCGATCGTAGTCCGGATCGCAGTCTGCAACTCGACTGCGTGAAGTCGGAATCGCTAGTAATCGTGAATC
>M00920:10:000000000-A292A:1:1104:5131:15907:1
GTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCCTAAACGATGTCGACTAGTCGTTCGGAGCAGCAATGCACTGAGTGACGCAGCTAACGCGTGAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGATGATGTGGATTAATTCGATGCAACGCGAAAAACCTTACCTACCCTTGACATGTCTGGAGCCTTGGTGAGAGCCGAGGGTGCCTTCGGGAGCCAGAACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGT
>M00920:10:000000000-A292A:1:1113:7839:16644:1
CGTTTAGGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCATGAGCGTCAGTACAGGCCCAGGGGGCTGCCTTCGCCATCGGTGTTCCTCCTGATCTCTACGCATTTCACTGCTACACCAGGAATTCCACACACTTCTGCCGTACTCTAGCCTTGCAGTCACAAACGCAGTTCCCAGGTTAAGCCCGGGGATTTCACATCTGTCTTACAAAAACGCCTCCGCACGCTTTACGCCCAGTAATTCCGATTAACGCTCGCACCCTACGTTTTACCGCGGCTGCTGGCACGTTTTTAGCCGGTGCTTCTTAGTCCGGTACCGTCATCCATGGCCTATGTTAGAGAC
lh3 commented 11 years ago

With your command line, fermi should not use ropebwt. Can you find string ropebwt in your makefile?

ctSkennerton commented 11 years ago

Yes I can, full makefile shown below

FERMI=fermi
UNITIG_K=200
OVERLAP_K=240

all:cdhitout_0.85.p2.mag.gz

# Construct the FM-index for raw sequences
cdhitout_0.85.raw.fmd:../cdhitout_0.85.fa
    (cat ../cdhitout_0.85.fa) | $(FERMI) ropebwt -a bcr -v3 -btNf cdhitout_0.85.raw.tmp - > $@ 2> $@.log

# Error correction
cdhitout_0.85.ec.fq.gz:cdhitout_0.85.raw.fmd
    (cat ../cdhitout_0.85.fa) | $(FERMI) correct -t 2  $< - 2> $@.log | gzip -1 > $@

# Construct the FM-index for corrected sequences
cdhitout_0.85.ec.fmd:cdhitout_0.85.ec.fq.gz
    $(FERMI) fltuniq $< 2> cdhitout_0.85.fltuniq.log | $(FERMI) ropebwt -a bcr -v3 -btf cdhitout_0.85.ec.tmp - > $@ 2> $@.log

# Generate unitigs
cdhitout_0.85.p0.mag.gz:cdhitout_0.85.ec.fmd
    $(FERMI) unitig -t 2 -l $(UNITIG_K) $< 2> $@.log | gzip -1 > $@

cdhitout_0.85.p1.mag.gz:cdhitout_0.85.p0.mag.gz
    $(FERMI) clean $< 2> $@.log | gzip -1 > $@
cdhitout_0.85.p2.mag.gz:cdhitout_0.85.p1.mag.gz
    $(FERMI) clean -CAOFo $(OVERLAP_K) $< 2> $@.log | gzip -1 > $@
lh3 commented 11 years ago

I see. I was using an old version of run-fermi.pl. More recent version use ropebwt by default. Anyway, I can see the problem now: fltuniq has filtered out all the reads, while ropebwt is expecting some input and thus hanging for some reason. For the time being, you can edit makefile and change the line containing fltuniq to cat $< | $(FERMI) ropebwt -a bcr -v3 -btf cdhitout_0.85.ec.tmp - > $@ 2> $@.log. This skips fltuniq. I will look into the ropebwt issue later. But anyway, probably you won't get a good assembly from these reads.

lh3 commented 11 years ago

For small files, actually we'd better not use fltuniq anyway. I should consider to add an option to optionally skip fltuniq altogether.

ctSkennerton commented 11 years ago

thanks, specifying -B in run-fermi.pl prevents the hang as well