marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

How to get blob plots #77

Closed Overcraft90 closed 2 years ago

Overcraft90 commented 2 years ago

Hi @arangrhie,

I got Merqury to work and obtained fairly good results for my human individuals. However, other than the standard plots how can I get the blob plots showing the "correctness" for the phasing?

I'm working with diploid assemblies for which I have both HiFi and HiC data (assembled with hifiasm) but no parental information e.g. no tiro data.

arangrhie commented 2 years ago

Hello, Merqury requires parental Illumina data or kmers for evaluating phasing correctness. There is no easy way to get phasing statistics without the parents. You may have to go back and use alignment based phasing statistics, which you could do using SNP based phasing from HiFi and HiC with a caveat that alignment based phasing may be less sensitive for detecting switch errors.

Overcraft90 commented 2 years ago

Hi and thanks,

I do have HiC long range contact information for all my individuals, so I can simply use it in conjunction with the .meryl db for the HiFi reads simply plugging in the respective db in this command:

merqury.sh [ ] [asm2.fasta]

Say, I generate a .meryl mat db and a .meryl pat db based on the HiC reads sets of both parents, and then I use them in the command alongside the main reads db for the individual and Merqury will automatically output for me the blob plots?

Let me know, thanks.

arangrhie commented 2 years ago

You mean HiC from the parents? I would consider that as parental information and try use that for generating mat pat hapmers. The kmers might be slightly less sensitive due to coverage limitations in HiC, but still better than nothing. You could try generating meryl dbs of each parental HiC reads and the child's reads, then run trio/hapmers.sh. You could visually check how representative the hap-mers are present given your child's read set (particularly, check the 1-copy peak).

Using HiFi reads for the child's meryl db is generally ok with the caveat that HiFi sequencing bias would favor HiFi specific errors. Blob plots and switch error doesn't use childs' kmers, so that wouldn't be affected.

Overcraft90 commented 2 years ago

Hi @arangrhie,

Thanks a lot for the detailed response! I will do as you said.

By going back to the data I realised the HiC data is for the individuals and not for the parents; however, three of these individuals are children of a parent-child trio. I think I should be able to get parental data – even Illumina short reads are available –, in this way I should be able to follow-up with evaluating phasing correctness with blob plots.

From your explanation I got that Illumina due to higher coverage are preferable compared to HiC, if so I will proceed and download the Illumina data for parents. I will keep you updated, thanks again.

Overcraft90 commented 2 years ago

Hi again @arangrhie,

I have some updates. Everything seemed to be proceeding smoothly with the execution until I got an error message possibly related to R.

I follow-up with a log file (Merqury.log) so that you can understand. The real issue is that because of this error I cannot even see the plots generated up to now (actually I cannot see any image at all); in fact, it looks like the error arose after the computation of phase blocks. Although this step happens after the Get blob plots, still I'm not able to visualize the output of any prior step...

Now, is this something I can fix; maybe even avoiding to rerun everything from the beginning? Let me know.

For completeness I share the code I used to generate this partial output

!/bin/bash

SBATCH --nodes=1 --tasks-per-node=1 --cpus-per-task=48

SBATCH --time=24:00:00

SBATCH --mem=700000

SBATCH --job-name=blob_plot

SBATCH --output=merqury_parental.out

SBATCH --partition=g100_usr_bmem

SBATCH --account=IscrC_PanSV

source /g100_work/PROJECTS/spack/v0.16/install/0.16.2/linux-centos8-skylake_avx512/gcc-8.3.1/anaconda3-2020.07-l2bohj4adsd6r2oweeytdzrgqmjl64lt/etc/profile.d/conda.sh conda activate /g100_work/IscrC_PanSV/quality_check

module load profile/bioinf module load bedtools/2.30.0 gnu/10.2.0--gcc--8.3.1 samtools/1.13 r/

cd /g100_work/IscrC_PanSV/HG00733_hifi

meryl k=21 count output mat.hapmers.meryl parental/mat* meryl k=21 count output pat.hapmers.meryl parental/pat*

$MERQURY/merqury.sh HG00733.meryl mat.hapmers.meryl pat.hapmers.meryl .hap1.fasta .hap2.fasta test-2

As you can see I loaded some modules, specifically bedtools, samtools and R, since compared to the standard execution without parental data I got a warning message related to those modules. For this reason but also because in my first run there was a mistake, I started over again but this time invoking also those modules.

Is that the source for the error? And if so, why this didn't happen when I was simply using the individual reads set (e.g. only the HG00733.meryl and no parental data).

Overcraft90 commented 2 years ago

Hi I just realised I had an error when loading R,

What I should have done is loading r/4.1.0--gcc--10.2.0-python--3.8.6 and not simply r/. Do you think this might have caused the issue?

In the meantime, I run the script again commenting the meryl.db commands for Illumina parental data and simply launching Merqury.

brianwalenz commented 2 years ago

The error was that R didn't know about the 'argparse' module from https://github.com/trevorld/r-argparse.

In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘argparse’

On my very non-standard system, I installed it manually with install.packages("argparse") (and maybe some other packages as well; I don't remember).

Overcraft90 commented 2 years ago

Thanks @brianwalenz, indeed I thought it was something related to argparse but I was surprised on a HPC cluster they didn't have something so basic as an R package...

It turned out this was the case, so I asked them to install it; however, in the meantime I transferred the db.meryl, the mat.hapmers.meryl and the pat.hapmers.maryl on my machine (in local) where I run the following command:

$MERQURY/merqury.sh 2.HG00733/HG00733.meryl 2.HG00733/mat.hapmers.meryl 2.HG00733/pat.hapmers.meryl 2.HG00733/.hap1.fasta 2.HG00733/.hap2.fasta test-2

But for some reasons the blobs plot is empty... do you have any idea why? I built the maternal and paternal meryl db based on paired-end Illumina reads, as per the command I quoted few posts above. Let me know what do you think, I can attach the plot if that can help.

Overcraft90 commented 2 years ago

Hi again @arangrhie & @brianwalenz,

I realised that in local my machine runs out of memory before even plotting the figures for the blob plots. I've been able to run it on a cluster and got these results:

test-2 hapmers blob

hap1 test-2 HG00733_hifi hic hap1 100_20000 phased_block blob

hap2 test-2 HG00733_hifi hic hap2 100_20000 phased_block blob

Is this normal? I've assembled with both HiFi data and Hi-C information using HiFiasm, so I should have haplotype-resolved assemblies (this looks like more a squashed assembly from Canu or FALCON) for both parental sequences... why does this happen?

Could it be because I generated the mat.hapmers.meryl and the pat.hapmers.meryl with the following command

meryl k=21 count output mat.hapmers.meryl parental/mat* meryl k=21 count output pat.hapmers.meryl parental/pat*

without doing the

meryl union-sum output mat.hapmers.meryl mat.hapmers* meryl union-sum output pat.hapmers.meryl pat.hapmers*

Let me know, thanks!

arangrhie commented 2 years ago

Hello @Overcraft90,

Not sure what parental/mat_* and parental/pat_* are, I assume some sort of parental Illumina fastq files? The mat.hapmers and pat.hapmers are the inherited, parental specific kmers, not the entire parental kmer set.

Do something like

meryl k=21 count output maternal.meryl parental/mat* meryl k=21 count output paternal.meryl parental/pat*

Then

$MERQURY/trio/hapmers.sh maternal.meryl paternal.meryl child.meryl

This will generate the proper *.hapmers.meryl that should be used for this evaluation.

It shouldn't matter to do the count or union-sum. If so, please open an issue on meryl.

More details: https://github.com/marbl/merqury/wiki/3.-Phasing-assessment-with-hap-mers

Overcraft90 commented 2 years ago

Hi there @arangrhie,

Thanks a lot! I realised I did a stupid mistake, I haven't run Merqury in the trio/hapmers.sh mode... I will fix this and let you know.

Yes, parental/ is the directory where I have the forward and reverse paired-end Illumina sequencing for both the mother and the father of the HG00733 individual.

Overcraft90 commented 2 years ago

The execution just finished; however, I'm now left with several files in the form *.hapmers.meryl and I'm not quite sure which one I should use in the command

$MERQURY/merqury.sh HG00733.meryl .hapmers.meryl .hapmers.meryl .hap1.fasta .hap2.fasta test-3

in order to get back proper blob plots. Thanks in advance!

P.S. I attach a screenshot for clarity (my guess would be the _mat.hapmers_not_pat. and the pat.hapmers_not_mat._; nonetheless, I think it is better to ask just to make sure). By the way, should I change also the child.meryl with something different in the command above? Screenshot from 2022-06-01 09-25-28

Overcraft90 commented 2 years ago

Hi there,

In the meantime I had a look at similar Issues in trying to find an answer, but I couldn't see any clear indication on what files to feed in the merqury.sh in order to get the correct blob plots.

Let me know, thanks!

Overcraft90 commented 2 years ago

Hi again,

Just wondering whether you have any advice on what to do.

Thanks!

Overcraft90 commented 2 years ago

Hi,

any news about the input data in order to get the blob plots? I got a bunch of output files from

$MERQURY/trio/hapmers.sh maternal.meryl paternal.meryl child.meryl

so, I was just wondering which ones to use (see pic above). Thanks!

Overcraft90 commented 2 years ago

Good news,

I finally generated the plot/s I needed. I reviewed the documentation and found out what files I had to use to get the phased blob plots. Here the output I was looking for:

hap1 test-3 HG00733_hifi hic hap1 100_20000 phased_block blob

hap2 test-3 HG00733_hifi hic hap2 100_20000 phased_block blob

Well, thanks a lot for all your help. I will be closing this issue for now.

arangrhie commented 2 years ago

Hello, glad you figured this out. Apologies for not getting back in time.