Open Raina-M opened 4 years ago
It's less dependent on the genome size per se than the fragmentation of the genome itself. If your reference genome is very fragmented (N50 < 10kb etc.) the size of the matrix will be enormous no matter the resolution and it will indeed take up a lot of memory and time. You need to preprocess your initial genome somewhat to avoid this - for instance you could run aggressive filtering so that many smaller contigs are eliminated from the scaffolding and don't clog up your RAM.
So I decreased the size of fragments contacts, which was generated from hicstuff (I used several replicates of Hi-C read files to generated contacts information at first.
What size did you use and did you try generating a matrix using restriction fragments instead?
Hello,
It is difficult to give you much information on the speed of the program as it can vary depending on the GPU you have.
As baudrly said, the process can be long when the assembly is very fragmented. instaGRAAL was tested with a human fragmented assembly, where the N50=300kb and the #contigs=10481. As this genome is large, the parameter --levels was set to 5, so the exact command line was:
instagraal --level 5 --cycle 15 hic_folder reference.fa output_folder
The Hi-C dataset used in this case had 2*442M reads (with 88% mapping). They were processed with hicstuff, and the matrix was generated using restriction fragments. On an RTX 2080 Ti, it took six days to run 15 cycles, and indeed ~30 GB on the host. At each round, a file called "genome.fasta" is created, so you can take a look at it to have a look at the improvements.
It is however surprising that it crashed on a 400 GB machine, so there must be some trouble somewhere. Could you please indicate the stats of your input assembly and the exact commands you used to run hicstuff and instagraal?
Thank you for the information, @baudrly and @nadegeguiglielmoni . After comparing with the human sample in your case, I do not think my input assembly is too fragmented with N50=1.05Mb, #contigs=8148. The Hi-C data sets I used for testing had 2*111M reads (file size: 4.5G and 5.1G). Here is the command to generate contact matrix:
hicstuff pipeline --aligner=bowtie2 --enzyme=DpnII \
--filter --threads=32 --plot \
-g $REF $INPUT1 $INPUT2 --outdir=$WD
And for instagraal, all parameters are default. So here is the command I used in the testing program:
instagraal hic_folder reference.fasta
It is indeed weird that it crashed on a 400G machine.
-What kind of crash was it? (Segfault, MemoryError in python) Do you have a stack trace by any chance?
-At what stage did it crash? (While generating the hdf5 files, computing the initial model, in the middle of scaffolding...)
-In the meantime, could you try to run the program with a lower resolution (advised for larger genomes) and possibly more filtering, and also with debbuging enabled? Something like:
instagraal --level 5 --coverage-std 0 --debug hic_folder reference.fa output_folder
Thanks @baudrly ! To be more clear, the contact matrix generated for my 'real sample' is much larger than the 'small sample' because of more replicates of Hi-C reads. Like I mentioned at the beginning, abs_fragments_contacts_weighted.txt
for 'small sample' is ~780M, but that for 'real sample' is ~5.3G. So the 'real sample' running crashed on a 400G machine. Then I started a testing run on 'small sample' by decreasing the size of contacts (using only one Hi-C replicate).
-What kind of crash was it? (Segfault, MemoryError in python) Do you have a stack trace by any chance?
The 'real sample' crashed after 5 hours with 'out of memory' on the server (400Gig Ram, 6Gig memory on the GPU). The 'small sample' went well, but I terminated it a few days ago because of the endless running...
-At what stage did it crash? (While generating the hdf5 files, computing the initial model, in the middle of scaffolding...)
Sorry, I could not remember or find logs for the 'real sample' because it crashed over one month ago, after which I turned to 'small sample' testing. But what I am sure about is that the program did not manage to build pyramid.
-In the meantime, could you try to run the program with a lower resolution (advised for larger genomes) and possibly more filtering, and also with debugging enabled?
I will check it out and give you feedback. And what impact would it have if I decrease --cycle
? Since even for 'small sample', one cycle elapsed around one day, 100 iterations look like endless...
Like I mentioned at the beginning, abs_fragments_contacts_weighted.txt for 'small sample' is ~780M, but that for 'real sample' is ~5.3G. So the 'real sample' running crashed on a 400G machine. Then I started a testing run on 'small sample' by decreasing the size of contacts (using only one Hi-C replicate).
Did you concatenate the abs_fragments_contacts_weighted.txt
files for each replicate on top of each other, so that duplicates could appear, or was the 5.3G file independently generated with hicstuff
? If it was, did you use the same command as the one you provided above? Also, does the program crash when you run it with more filtering (-c 0
)?
The 'small sample' went well, but I terminated it a few days ago because of the endless running...
To be clear, you don't have to wait for the program to run all the cycles you specified as input, it's only an upper limit. The program creates a new genome after each cycle, so you can check it out and stop if it's satisfactory. Depending on whether you ran the version of the program with the movie enabled, you could have a clearer look into how well the scaffolding did.
I will check it out and give you feedback. And what impact would it have if I decrease --cycle? Since even for 'small sample', one cycle elapsed around one day, 100 iterations look like endless...
Past a certain point adding more cycles only has marginal benefit. In most cases, the scaffolding happens actually pretty quickly, in the first few cycles, as in the example below (each of the twenty frames represents one cycle):
so it's unlikely you'd lose out on much by stopping the program after a while, especially if you can polish the scaffolds afterwards.
Also, unless you are really distrustful of the initial contigs' quality, binning more aggressively (level > 5) could speed up the scaffolding by several orders of magnitude at little extra cost, especially given your N50.
A 5.3 GB matrix seems normal, the matrix we had for human with 442 millions was also a little more than 5 GB. But it should not crash a 400 GB machine (in our case, the machine was "only" 128 GB and it did not go over 30 GB to my knowledge). I would suggest that you run again the latest version of hicstuff with all your Hi-C reads, and try instaGRAAL then. I also noticed that you did not use the iterative mapping for hicstuff (parameter --iterative). It can impact the percentage of mapped reads, so I would recommend it. Last but not least, is your assembly pur ONT? We have noticed that with pur ONT assemblies there could be cross mapping that impacted the scaffolding, but that is no longer the case when the assembly is polished with Illumina or PacBio.
Thank you both for prompt replies! @baudrly
Did you concatenate the abs_fragments_contacts_weighted.txt files on top of each other, so that duplicates could appear, or was the 5.3G file independently generated with hicstuff?
abs_fragments_contacts_weighted.txt
was generated by hicstuff rather than concatenating.
Also, does the program crash when you run it with more filtering (-c 0)?
Sorry, not try it yet. But I will update after running with tuned parameters.
Depending on whether you ran the version of the program with the movie enabled, you could have a clearer look into how well the scaffolding did.
I am curious how you achieved the visualization. Should I run HiC-box to have a look at the dynamics of scaffolding results, or there are parameters to enable? FYI, I use instagraal 0.1.2
.
@nadegeguiglielmoni
I would suggest that you run again the latest version of hicstuff with all your Hi-C reads, and try instaGRAAL then.
I used latest version hicstuff 2.0.8
, and like I said contact matrix was generated by hicstuff, not from concatenation. But I will try the option --iterative
later on.
In addition, my assembly was generated by Canu using PacBio reads, and then polished with both PacBio and Illumina reads.
I am curious how you achieved the visualization. Should I run HiC-box to have a look at the dynamics of scaffolding results, or there are parameters to enable? FYI, I use instagraal 0.1.2.
If you're talking about a visualization like this: you need to run the version with OpenGL support. You also need an X server for the movie to display, so it would probably not work on a cluster server.
If you're talking about the twenty-frame animation above, I used the --save-matrix
option so that it would keep a copy of the matrix at each cycle in the output folder, which I then rendered and stitched into a gif. (I wouldn't recommend using the intermediate matrices for advanced data analyses as they have been force-fitted into low-res 1000x1000 contact maps, they are mostly there for visualization purposes.)
FYI, I use instagraal 0.1.2.
I recommend using a newer version (currently 0.1.6 on PyPI but your best bet is to simply install the very latest from github, possibly this one if you don't want to rely on OpenGL).
To improve speed, you should go for the version without OpenGL because the display slows down the process. When you have run hicstuff, you should look at the plots, to check that you have a good proportion of 3D contacts.
Hello there, As you suggested, I installed the version without openGL to improve speed, but an error occurred due to some GL references in instagraal.py and cuda_lib_gl_single.py.
INFO :: cycle = 0
INFO :: generate new fasta file...
Traceback (most recent call last):
File "/package/instagraal/bin/instagraal", line 11, in <module>
load_entry_point('instagraal==0.1.6', 'console_scripts', 'instagraal')()
File "/package/instagraal/lib/python3.7/site-packages/instagraal-0.1.6-py3.7.egg/instagraal/instagraal.py", line 1190, in main
save_matrix=save_matrix,
File "/package/instagraal/lib/python3.7/site-packages/instagraal-0.1.6-py3.7.egg/instagraal/instagraal.py", line 334, in full_emsampler
matrix = self.simulation.sampler.gpu_imsampler_gl.get()
AttributeError: 'sampler' object has no attribute 'gpu_im_gl'
Do you have any clues about how to solve this problem?
Hello,
I am sorry for not answering earlier. We are investigating this issue. in the meantime, could try installing this fork https://github.com/nowanda92/instaGRAAL ? You may have an error that the kernels are missing somewhere, then you should just copy them from the repo directly to where they are missing. I am sorry for the trouble, we are still working on improving instaGRAAL.
Hello,
I have the same issue with gpu_im_gl
when using --save-matrix
(Docker). Do you have any solution?
I tried it with the default parameters, and it runs to: INFO :: generate new fasta file... INFO :: cycle=14
Do you know how many cycle it gonna take? And if the job is killed by the system, is the generated genome.fasta good to use?
@anhong11 By default instaGRAAL runs for 100 cycles. If the job is killed before it reaches 100 cycles, you can use the output as long as the likelihood has converged (you can plot list_likelihood.txt to check that). Rather than using genome.fasta directly, you should run the command instagraal-polish using the output info_frags.txt and the initial fasta file to perform automatic curation of the assembly.
Thank you. It looks converged, right?
At first I ran a real-case sample with genome size ~2.6Gb and then
instagraal
crashed on a 400Gig machine. So I decreased the size of fragments contacts, which was generated fromhicstuff
(I used several replicates of Hi-C read files to generated contacts information at first. After the machine crashed, I only used one replicate to generate smaller contact files). But still, the smaller sample needs ~30Gig Memory on the host, and 2Gig on the GPU, with 780Mabs_fragments_contacts_weighted.txt
. In addition, the program ran super slowly that had ran over 1 month before killed. Before ends, it had generated new fasta file in cycles, taking approximately a day per cycle. But that's only a test sample, so I terminated it in cycle 28. Nevertheless, could you provide any information about the computational resource usage of the program. Like what I have demonstrated, the 'small' sample seems to run endless... I doubt the possibility to apply it on my real sample.