Closed pkerpedjiev closed 2 weeks ago
It looks like there's a slide
parameter in the config that may be helpful?
Hi Peter,
This is a small window size for the default behavior, but I am assuming you are trying to get the other behavior, which is color-proportional to the number of reads mapped to each bin?
I have tested that very little since I haven't used that feature, but it is too bad you are finding this bug.
Maybe we could fix this by adding the --no-PG
flag to every samtools command?
Do you have any way to confirm that the numerous @PG lines are the issue?
Oh, I actually wasn't trying to do the color-proportional read density mode. I'll crank up the "window" parameter and give it a try.
If that doesn't work I'll try adding the --no-PG
flag.
Thanks for the suggestions!
Great, I usually go for a window size greater than or equal to 2000 for whole genome (and 500 bp is the minimum).
So I raised the window size to 5000 but I'm running into really large RAM usage The following command gets killed on a C7I.8xlarge instance (64gb RAM):
minimap2 -t 16 -f 10000 -s 1000 -K 800000 -ax ava-ont --dual=yes --eqx results/mPanTro3.pri.cur.20231122.identity.5000.fasta temp/mPanTro3.pri.cur.20231122.identity.5000.2.query.fasta -o temp/mPanTro3.pri.cur.20231122.identity.5000.10000.2.ref_0.bam
Do you happen to remember what params you used for running StainedGlass cooler on the T2T assembly and how long it took?
I'm just wondering if I'm doing something silly or if a long run time / high RAM usage is expected.
I used 10k windows for CHM13 (at least one of the runs). For whole genome I would try making many more alignment batches in the config yaml (n_batches: 1000) and try again. Assuming you haven't adjusted that already?
However, I will say when I ran this for CHM13 I was on a node with 2TB of ram and wasn't paying attention to memory use.
I think CHM13 took 2 days but as I said I was using a pretty massive node on our HPC (80 cores + hyper-threading I think).
I haven't adjusted the number of batches but the runtime is looking prohibitively long given the resources at my disposal (about 32 cores and 128 gigs of RAM). That minimap2 -f 10000
option really grinds progress to a halt. But with a lower value the plots don't look quite as filled out.
Maybe I'll get some cooler_density
style plots ready using bwa aln
and then come back to the long read ones. Thanks for the help and quick replies!
You may want to play with:
https://github.com/marbl/ModDotPlot
Though I am not sure they have the cooler outputs.
Thanks for the link! I didn't know that existed.
It's a bummer they didn't add that speedy ANI calculation to StainedGlass but that's alright. I managed to create the density plots for the apes:
https://resgen.io/paper-data/Yoo%202024%20-%20All/views/JwAI1gLuR-azCKiaYcbz6Q
It seems like plotting the density of short reads approximates sequence identity but I'm not sure I can prove it. I should probably read the ModDotPlot paper to see what they're doing.
In any case, I appreciate your help! For the tasks that didn't exceed my RAM StainedGlass worked perfectly.
I am glad to have a new tool on the block but will be sad to see the name stainedglass fade. I just spoke with Adam at the T2T about moddotplot and he said they would be willing to add in the cooler files as an output if that's something you wanted.
Cooler as an output would be great! I, for one, would be interested in comparing the resulting plots. And if there's cooler output, the identity plots can be viewed in HiGlass with gene annotations and other additional tracks.
Now that you mention it, from their documentation it sounds like they produce paired end bedfiles. If those are the same format that StainedGlass produces, then it should be nearly trivial to take that output and plug it in to the StainedGlass workflow. Or vice versa.
Hi Mitchell,
I'm trying to run StainedGlass on the new
mPanTro3
assembly. When I run the following:I get this error:
If I try to view the headers of the bam files I get:
If I'm not mistaken StainedGlass breaks the input fasta into lots of smaller chunks that it maps back to. Based on https://github.com/samtools/samtools/issues/1613 I'm assuming that that produces too many headers to fit into the bam file properly.
I'm trying with
window=64
right now but I'm wondering if there's a way to split those alignment steps per chromosome so each individual bam file has only a limited number of sequences in the header. Or to skip thesamtools sort
step.Or maybe you've encountered this before and have a different solution?