Difference in 2 and 3 copy kmers between read data sets

SheepwormJM commented 3 years ago

Hi,

I was looking to see what else I could pull out of the data and decided to compare the 2-copy and 3-copy kmers in different genomes.

For this I used the output files: genome.genome.spectra-cn.hist genome.genome.only.hist

And then I used awk to extract 2 and 3 copy kmers from the spectra-cn.hist file and sum them ($3 in file). For the 2-copy kmers I then added the 2-copy number from the only.hist file.

However, using the same genome, but two different data sets I found different results:

Genome    | 2-copy     | 3 copy*readdata
Tc_171026 | 66,301,944 | 12,801,905*Tci2W&NW
Tc_171026 | 65,692,035 | 13,050,376*poolseq

I can't understand why this would be? I'm assuming that the 2 and 3 copy etc are related to the number of times they occur in the genome. So, for the 2-copy, although the output in only.hist will be different between different read sets, I don't understand why the sum of all the 2-copy kmers in both files would change.

What am I getting wrong?

Thanks! Jenni

arangrhie commented 3 years ago

Hi Jenni,

The spectra-cn.hist and only.hist reflects what was found in the assembly. Adding them does not necessarily reflect the expected 2- or 3-copy in the genome.

Hope this answer your question! Arang

SheepwormJM commented 3 years ago

Thanks Arang. I think maybe I wasn't clear. I'm looking to compare Genome A with Genome B, where genome B should be an improved, less haplotypic version of genome A.

So I thought I might be able to do that by comparing the number of kmers which occur twice in genome A with genome B etc.

As I understand it, the spectra-cn.hist output file contains the information required for the layered histogram plots - the first column relating to the colour (read-only/1/2/3), the next the multiplicity of the kmers in the read set and the third the total count of distinct kmers.

While the only.hist output file contains those kmers found in the assembly, but not the read data, and only for kmers found once or twice in the genome assembly.

So, I was looking to extract the total number of kmers which occured in the genome twice - adding the kmers occuring twice in the only.hist file with those occuring twice ('2' in column 2) in the spectra-cn.hist file.

And then compare the data for genome A and genome B, to see if there was a reduction in the number of 2-copy kmers found in the assembly.

But first I decided to look at genome A twice, using two different sets of read data. Which was when I found that the number of kmers occurring twice in genome A was different for each read set.

I don't understand why this is - apologies if I've totally missed the point! If so, is there a way to compare the genomes like this - to see for reduction of the falsely duplicated regions?

Thanks, Jenni

arangrhie commented 3 years ago

Hi Jenni,

Is the purpose to compare the falsely duplicated kmers in each assembly A and B? If that's the purpose, run

$MERQURY/eval/false_duplication.sh <name.asm.spectra-cn.hist>
Get the number of additional k-mers found in an assembly in the single and two copy k-mer peaks from the reads
<name.asm.spectra-cn.hist>: spectra-cn.hist generated with Merqury for a pseudo-haplotype or mixed haplotype assembly

I assume your "haplotypic version of genome A" is a pseudo-haplotype or a haploid representation of a diploid genome.

The asm_only.hist are considered as artificial kmers (consensus error in the assembly), that do not exist in a genome. Those aren't biologically relevant, so my script above is not including them.

Of note - The spectra-cn.hist columns are 1: multiplicity in the assembly 2: multiplicity in the reads 3: number of kmers

So if you want to extract the num. of kmers occuring twice in your assembly,

two_cp=`awk -v cutoff=$cutoff '$1==2 && $2<cutoff {sum+=$NF} END {print sum}' $hist`

is doing what you want. This is extracting lines where the 1st column = 2, and adds up all the kmers in the 3rd column where the 2nd column is smaller than the cutoff. In case your haploid peak is larger than the 2-copy peak, you may need to adjust the cutoff accordingly.

Might be easier to explain if you could share your spectra-cn line plot.

Thanks, Arang

SheepwormJM commented 3 years ago

Hi Arang,

Thanks for your reply and sorry for my delayed response - had several things on and needed to think about it.

Is the purpose to compare the falsely duplicated kmers in each assembly A and B?

Yes, that's exactly what I'm wanting to do.

Our worms are quite heterozygous (the plots also look quite odd - might be issues with sequencing, not sure, maybe just low sequencing and a peak is hidden in the 'error' peak...??).

However, we basically have an assembly which we know still contains a lot of falsely duplicated regions. It's been made from a pool of worms using PacBio, and we don't expect to find all the kmers in the Illumina data to be also present in the assembly.

So, I'm more interested in the overall duplicated kmers in the assembly, rather than the ones only found in the read data. I think I've been doing what you suggest above using the spectra-cn.hist file. (Do your false duplication.sh script and the awk command essentially do the same thing)? I used awk to first extract the lines in the spectra-cn.hist file with $1==2, and then summed $3 separately from the extracted lines.

I then added the 2-copy data from the only.hist file.

But I found that my total differed if I used different read data on the same genome, and I can't understand why that would be happening.

This is my plot, and my assumption....

Thanks again, Jenni

arangrhie commented 3 years ago

Hi Jenni,

Sorry for the delay! There was a lot going on..

See this thread to obtain better visibility on your spectra-cn plot. The error bar is too high, making it hard to visualize the actual spectrum plot of kmers found > 1. Can you post the two spectra-cn plots, with the refined x and y limits?

Anyways, it seems like your read dataset is containing a lot more kmers that aren't seen in the assembly, and the error-bar is very high - meaning there is a lot not found in the reads.

To collect false duplications, I suggested to use

false_duplications.sh : This collects num. of kmers seen >1 in the ~2 copy of the read kmer distribution. The awk commends I posted are explaining what this script is doing.

spectra-cn.hist file with $1==2

This is collecting all kmers seen twice in the assembly. Without limiting to the ~2copy region, it's likely you are counting kmers that should be found multiple times in an assembly.

only.hist file with $1==2

This is actually representing the number of distinct kmers found multiple times only in the assembly. (Sorry for the misleading color code, I couldn't find a better way to visualize this!)

However, my point was that without knowing what is expected to be found once in the assembly, it's impossible to know what is duplicated. We can't assume the correct copy number of the assembly-only kmers are always 1.

Thanks, Arang

SheepwormJM commented 3 years ago

Hi Arang,

Thanks! I realise the difference now between your false_duplications and spectra_cn file. That's a very good point about the 2-copy vs the read 2-copy.

For the plots - thanks for the link to the other thread. I'd actually found it after I'd posted the above. I don't know if there is an issue with the plotting script though. I'd already been using the -m, -n and -z to re-plot the plots before, but couldn't get the y-axis to scale as I requested, although the x-axis always did. After finding that thread I tried it without the asm file and the y-axis scaled beautifully. I'm not sure if it's picking it up to plot the plot, then over-riding it with my asm errors - the plots would scale as I requested if the error bar was less than the specified y-axis limit.

Thanks again for the tool! Jenni

arangrhie commented 3 years ago

Ah right, yes, I purposely made the Y scale to include the maximum from the error (asm-only) when it is given. You could comment this line and try again, or just exclude the asm error file as you did. I just thought it would be more natural to include the bar instead of making it jump (clipped) out of the plot.

Enjoy evaluating your assemblies! Arang

SheepwormJM commented 3 years ago

Thanks Arang! It's been really good. :D

marbl / merqury

Difference in 2 and 3 copy kmers between read data sets #40