mrvollger / StainedGlass

Make colorful identity heatmaps of genomic sequence
https://mrvollger.github.io/StainedGlass/
MIT License
98 stars 10 forks source link

samtools sort: failed to read header from "-" #48

Closed pkerpedjiev closed 2 weeks ago

pkerpedjiev commented 1 month ago

Hi Mitchell,

I'm trying to run StainedGlass on the new mPanTro3 assembly. When I run the following:

snakemake --cores 8 --config sample=.mPanTro3 \
        fasta=mPanTro3.pri.cur.20231122.fasta.gz window=32,
        cooler_window=100 --use-conda cooler \
        -s  /opt/StainedGlass/workflow/Snakefile \
        --configfile /opt/StainedGlass/config/config.yaml

I get this error:

[E::sam_hdr_create] Invalid header line: must start with @HD/@SQ/@RG/@PG/@CO
samtools sort: failed to read header from "-"

If I try to view the headers of the bam files I get:

(base) toc3~/projects/StainedGlass/data(main|…) % samtools view -H temp/mPanTro3.pri.cur.20231122.32.10000.2.ref_0.unsorted.bam | less
[main_samview] failed to write the SAM header
samtools view: error closing standard output: -1

If I'm not mistaken StainedGlass breaks the input fasta into lots of smaller chunks that it maps back to. Based on https://github.com/samtools/samtools/issues/1613 I'm assuming that that produces too many headers to fit into the bam file properly.

I'm trying with window=64 right now but I'm wondering if there's a way to split those alignment steps per chromosome so each individual bam file has only a limited number of sequences in the header. Or to skip the samtools sort step.

Or maybe you've encountered this before and have a different solution?

pkerpedjiev commented 1 month ago

It looks like there's a slide parameter in the config that may be helpful?

mrvollger commented 1 month ago

Hi Peter,

This is a small window size for the default behavior, but I am assuming you are trying to get the other behavior, which is color-proportional to the number of reads mapped to each bin?

I have tested that very little since I haven't used that feature, but it is too bad you are finding this bug.

Maybe we could fix this by adding the --no-PG flag to every samtools command?

Do you have any way to confirm that the numerous @PG lines are the issue?

pkerpedjiev commented 1 month ago

Oh, I actually wasn't trying to do the color-proportional read density mode. I'll crank up the "window" parameter and give it a try.

If that doesn't work I'll try adding the --no-PG flag.

Thanks for the suggestions!

mrvollger commented 1 month ago

Great, I usually go for a window size greater than or equal to 2000 for whole genome (and 500 bp is the minimum).

pkerpedjiev commented 1 week ago

So I raised the window size to 5000 but I'm running into really large RAM usage The following command gets killed on a C7I.8xlarge instance (64gb RAM):

minimap2             -t 16             -f 10000 -s 1000             -K 800000             -ax ava-ont              --dual=yes --eqx             results/mPanTro3.pri.cur.20231122.identity.5000.fasta temp/mPanTro3.pri.cur.20231122.identity.5000.2.query.fasta -o temp/mPanTro3.pri.cur.20231122.identity.5000.10000.2.ref_0.bam

Do you happen to remember what params you used for running StainedGlass cooler on the T2T assembly and how long it took?

I'm just wondering if I'm doing something silly or if a long run time / high RAM usage is expected.

mrvollger commented 1 week ago

I used 10k windows for CHM13 (at least one of the runs). For whole genome I would try making many more alignment batches in the config yaml (n_batches: 1000) and try again. Assuming you haven't adjusted that already?

However, I will say when I ran this for CHM13 I was on a node with 2TB of ram and wasn't paying attention to memory use.

I think CHM13 took 2 days but as I said I was using a pretty massive node on our HPC (80 cores + hyper-threading I think).

pkerpedjiev commented 1 week ago

I haven't adjusted the number of batches but the runtime is looking prohibitively long given the resources at my disposal (about 32 cores and 128 gigs of RAM). That minimap2 -f 10000 option really grinds progress to a halt. But with a lower value the plots don't look quite as filled out.

Maybe I'll get some cooler_density style plots ready using bwa aln and then come back to the long read ones. Thanks for the help and quick replies!

mrvollger commented 6 days ago

You may want to play with:

https://github.com/marbl/ModDotPlot

Though I am not sure they have the cooler outputs.

pkerpedjiev commented 5 days ago

Thanks for the link! I didn't know that existed.

It's a bummer they didn't add that speedy ANI calculation to StainedGlass but that's alright. I managed to create the density plots for the apes:

https://resgen.io/paper-data/Yoo%202024%20-%20All/views/JwAI1gLuR-azCKiaYcbz6Q

It seems like plotting the density of short reads approximates sequence identity but I'm not sure I can prove it. I should probably read the ModDotPlot paper to see what they're doing.

In any case, I appreciate your help! For the tasks that didn't exceed my RAM StainedGlass worked perfectly.

mrvollger commented 5 days ago

I am glad to have a new tool on the block but will be sad to see the name stainedglass fade. I just spoke with Adam at the T2T about moddotplot and he said they would be willing to add in the cooler files as an output if that's something you wanted.

pkerpedjiev commented 5 days ago

Cooler as an output would be great! I, for one, would be interested in comparing the resulting plots. And if there's cooler output, the identity plots can be viewed in HiGlass with gene annotations and other additional tracks.

Now that you mention it, from their documentation it sounds like they produce paired end bedfiles. If those are the same format that StainedGlass produces, then it should be nearly trivial to take that output and plug it in to the StainedGlass workflow. Or vice versa.