Closed kmhernan closed 8 years ago
Thanks for your feedback!
Could you also send me ragout.log file from working directory?
Probably, it would be hard to guess, where this error comes from (most likely, something weird has happened in a different part of the program). Do you think it would be possible to send me the intermediate data (it will include contig sequences) so I can reproduce the bug? Otherwise, I can prepare a special debug version to track it.
Thanks for the quick reply. It seems like it "finishes", as it writes out the fasta, and parts of the agp (doesn't write the links file) before it throws the error. By intermediate files do you mean everything in the hal-workdir
? I spoke to my collaborators and they are ok with me giving you whatever data you need. Let me know which files, and where I should email a dropbox link to (don't want to share it here as it's part of an upcoming manuscript).
Thank you! I don't see anything suspicious in the graph, so it would be great if you could send me the files from "hal-workdir" (except "alignment.maf", it's not needed) and the recipe file. You can email me at fenderglass@gmail.com
Email sent. Let me know if anything else is needed.
Please check the newest update from master, it should be fixed now.
Ok great. Thank you. I'm running now 👍
Hmm I have this error, is it related?
[06:49:38] ERROR: An error occured while running Ragout:
[06:49:38] ERROR: Error building overlap graph: Non-zero return code when calling ragout-overlap module: -6
It should not be related. Is it the same dataset? Could it be out of memory issue?
Same sample, but re-ran progresiveCactus using all of the assembly (including the small ones). I could be memory but I do have ~ 1Tb on the machine. I'm trying with the previous version to check.
Hm, 1Tb is definitely enough (it took about 32 Gb for me). Then it might be some strange bug in overlap module. Could you send me "target.fasta" from "hal-workdir", so I can check it?
Sure thing. Sending email shortly
Thanks again! There are about 800k sequences in the newer assembly (vs 80k) in the original one. It seems that overlap module can't efficiently handle that many sequences (although the total length is almost the same). Moreover, event if it did finish, searching for paths in this graph would have been extremely slow.
I would recommend to stick with the previous strategy - filter out sequences that are too short. It is somewhat controversial with what we have in the manual, but it was written for bacterial assemblies and assembly graph for mammalian-scale assemblies could be too tangled.
Sorry about that - this part of Ragout (refinement with assembly graph) is still unstable for large genomes. If you will encounter any other strange issues, you may consider disabling it (--no-refine option). Most likely the sequence loss will be minor (especially, for protein-coding sequence).
Thank for the clarification. That makes sense. I reran the same dataset I originally sent you with the patch and it ran without the error. The results were completely different than the first run, but made more biological sense. I believe I will go ahead and close this. I do suggest adding you last comment to your online manual/docs. I know that another fish assembly used ragout, but I don't know if they only used a subset of the scaffolds.
@fenderglass I noticed that the synteny estimates were quite different between this patch version and the previous version I was using.
Previous version:
[11:22:15] root: DEBUG: "JD30" synteny blocks coverage: 94.18%
[11:22:16] root: DEBUG: "GRCz10" synteny blocks coverage: 91.58%
Patch version:
[10:35:58] root: DEBUG: "JD30" synteny blocks coverage: 95.22%
[10:35:58] root: DEBUG: "GRCz10" synteny blocks coverage: 59.3%
Is this to be expected?
I agree, the manual should be updated with respect to large genome assemblies.
Do I understand correctly - you are saying that the newest assembly of the dataset you sent to me is very different to the assembly of the same dataset made with the previous Ragout version (which crashed at the end, but produced fasta output). This is a bit strange, but maybe related to your last concern..
Regarding synteny blocks - did you originally use a version from releases page (e.g., not the newest from master)? We have recently made some changes in synteny block detection module, which made it more conservative (blocks are more reliable but sometimes shorter). You are saying that the newer version seems to be more correct? Could you send me the full log from the newest run?
Yes, the same dataset that ran before that crashed is very different between versions. I think it is related. Ah... my first run was using v1.2 and the patched one is version 2.0b I didn't notice the large difference. That makes sense so I don't think that I need to send the log file, but i am attaching anyways. Feel free to close if this is the issue. ragout.log.txt
Ok, I don't see anything suspicious in the log. Let me know, if you have any other questions!
Running Ragout I get this assertion error:
I have run Ragout on 2 other assemblies with no problem using this same version of ragout. I used HAL input from progressiveCactus for all. Let me know if there is more information that is needed.