mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
149 stars 27 forks source link

AssertionError: seq_end > seq_start #11

Closed kmhernan closed 8 years ago

kmhernan commented 8 years ago

Running Ragout I get this assertion error:

18:03:17] INFO: Generating FASTA output
Traceback (most recent call last):
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout.py", line 32, in <module>
    sys.exit(main())
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/main.py", line 268, in main
    return run_ragout(args)
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/main.py", line 95, in run_ragout
    run_unsafe(args)
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/main.py", line 232, in run_unsafe
    out_gen.make_output(args.out_dir, recipe["target"])
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/scaffolder/output_generator.py", line 43, in make_output
    self._output_agp(out_agp, out_prefix)
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/scaffolder/output_generator.py", line 106, in _output_agp
    chr_end = chr_pos + contig.length()
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/shared/datatypes.py", line 146, in length
    return self.perm.length()
  File "/group/biocore-analysis/khernandez/CRI-BIO-338-Peds-deJong/2016_analysis/tools/ragout-1.2-linux-x86_64/ragout/shared/datatypes.py", line 49, in length
    assert self.seq_end > self.seq_start
AssertionError

I have run Ragout on 2 other assemblies with no problem using this same version of ragout. I used HAL input from progressiveCactus for all. Let me know if there is more information that is needed.

mikolmogorov commented 8 years ago

Thanks for your feedback!

Could you also send me ragout.log file from working directory?

Probably, it would be hard to guess, where this error comes from (most likely, something weird has happened in a different part of the program). Do you think it would be possible to send me the intermediate data (it will include contig sequences) so I can reproduce the bug? Otherwise, I can prepare a special debug version to track it.

kmhernan commented 8 years ago

Thanks for the quick reply. It seems like it "finishes", as it writes out the fasta, and parts of the agp (doesn't write the links file) before it throws the error. By intermediate files do you mean everything in the hal-workdir ? I spoke to my collaborators and they are ok with me giving you whatever data you need. Let me know which files, and where I should email a dropbox link to (don't want to share it here as it's part of an upcoming manuscript).

ragout.log.txt

mikolmogorov commented 8 years ago

Thank you! I don't see anything suspicious in the graph, so it would be great if you could send me the files from "hal-workdir" (except "alignment.maf", it's not needed) and the recipe file. You can email me at fenderglass@gmail.com

kmhernan commented 8 years ago

Email sent. Let me know if anything else is needed.

mikolmogorov commented 8 years ago

Please check the newest update from master, it should be fixed now.

kmhernan commented 8 years ago

Ok great. Thank you. I'm running now 👍

kmhernan commented 8 years ago

Hmm I have this error, is it related?

[06:49:38] ERROR: An error occured while running Ragout:
[06:49:38] ERROR: Error building overlap graph: Non-zero return code when calling ragout-overlap module: -6
mikolmogorov commented 8 years ago

It should not be related. Is it the same dataset? Could it be out of memory issue?

kmhernan commented 8 years ago

Same sample, but re-ran progresiveCactus using all of the assembly (including the small ones). I could be memory but I do have ~ 1Tb on the machine. I'm trying with the previous version to check.

mikolmogorov commented 8 years ago

Hm, 1Tb is definitely enough (it took about 32 Gb for me). Then it might be some strange bug in overlap module. Could you send me "target.fasta" from "hal-workdir", so I can check it?

kmhernan commented 8 years ago

Sure thing. Sending email shortly

mikolmogorov commented 8 years ago

Thanks again! There are about 800k sequences in the newer assembly (vs 80k) in the original one. It seems that overlap module can't efficiently handle that many sequences (although the total length is almost the same). Moreover, event if it did finish, searching for paths in this graph would have been extremely slow.

I would recommend to stick with the previous strategy - filter out sequences that are too short. It is somewhat controversial with what we have in the manual, but it was written for bacterial assemblies and assembly graph for mammalian-scale assemblies could be too tangled.

Sorry about that - this part of Ragout (refinement with assembly graph) is still unstable for large genomes. If you will encounter any other strange issues, you may consider disabling it (--no-refine option). Most likely the sequence loss will be minor (especially, for protein-coding sequence).

kmhernan commented 8 years ago

Thank for the clarification. That makes sense. I reran the same dataset I originally sent you with the patch and it ran without the error. The results were completely different than the first run, but made more biological sense. I believe I will go ahead and close this. I do suggest adding you last comment to your online manual/docs. I know that another fish assembly used ragout, but I don't know if they only used a subset of the scaffolds.

kmhernan commented 8 years ago

@fenderglass I noticed that the synteny estimates were quite different between this patch version and the previous version I was using.

Previous version:

[11:22:15] root: DEBUG: "JD30" synteny blocks coverage: 94.18%
[11:22:16] root: DEBUG: "GRCz10" synteny blocks coverage: 91.58%

Patch version:

[10:35:58] root: DEBUG: "JD30" synteny blocks coverage: 95.22%
[10:35:58] root: DEBUG: "GRCz10" synteny blocks coverage: 59.3%

Is this to be expected?

mikolmogorov commented 8 years ago

I agree, the manual should be updated with respect to large genome assemblies.

Do I understand correctly - you are saying that the newest assembly of the dataset you sent to me is very different to the assembly of the same dataset made with the previous Ragout version (which crashed at the end, but produced fasta output). This is a bit strange, but maybe related to your last concern..

Regarding synteny blocks - did you originally use a version from releases page (e.g., not the newest from master)? We have recently made some changes in synteny block detection module, which made it more conservative (blocks are more reliable but sometimes shorter). You are saying that the newer version seems to be more correct? Could you send me the full log from the newest run?

kmhernan commented 8 years ago

Yes, the same dataset that ran before that crashed is very different between versions. I think it is related. Ah... my first run was using v1.2 and the patched one is version 2.0b I didn't notice the large difference. That makes sense so I don't think that I need to send the log file, but i am attaching anyways. Feel free to close if this is the issue. ragout.log.txt

mikolmogorov commented 8 years ago

Ok, I don't see anything suspicious in the log. Let me know, if you have any other questions!