marbl / meryl

A genomic k-mer counter (and sequence utility) with nice features.
115 stars 13 forks source link

How to get haplotype-specific reads? #52

Open DayTimeMouse opened 1 month ago

DayTimeMouse commented 1 month ago

Hi,

Thanks for developing this awesome tools. I'd like to get haplotype-specific HiFi reads via haplotype-specific kmers.

I run script to get haplotype-specific kmers, meryl version is 1.3: meryl count k=21 hap1.fa output hap1.meryl meryl count k=21 hap2.fa output hap2.meryl meryl difference hap1.meryl hap2.meryl output hap1.only.meryl meryl difference hap2.meryl hap1.meryl output hap2.only.meryl meryl intersect hap1.meryl hap2.meryl output hap1_hap2_intersect.meryl

get read-meryl: meryl count k=21 hifi_reads.fq.gz output hifi.meryl

I want to partition HiFi reads(each read) into two haplotypes. Now, I do not know how to do next? meryl can do this? Or, do you have some ideas for this?

Thanks in advance.

arangrhie commented 1 month ago

Hi @DayTimeMouse,

One fundamental question is, how confident you are with your hap1.fa and hap2.fa. Looks like your hap1.meryl and hap2.meryl are reflecting what's unique in each assembly, not the biologically inherited parental specific kmers.

Anyways, you could try meryl-lookup -existence -sequence hifi_reads.fq.gz -mers hap1.meryl hap2.meryl > hifi_reads.hapmer.count. This will provide count of kmers shared (found) in each read sequence.

Small explanation from the help message:

...
     The output file has format:
         <sequence_name> <tab> <kmers_in_sequence> <tab> <kmers_in_db> <tab> <kmers_shared> [...]

     With one input database:
         sequence1 <tab> 8415 <tab> 12856825 <tab> 8145

     With two input databases:
         sequence1 <tab> 8415 <tab> 12856825 <tab> 8145 <tab> 575757256 <tab> 8354

You could bin the reads based on the hap1 and hap2 mers found in each fastq sequence. Usually, it's expected to find kmers from one haplotype and a few from the other haplotype, due to possible noises in the reads. There may be reads with no hapmers if the two haplotypes share a large homozygous block. If there are any switch errors in the assembly, you'd see reads with substantial kmers with both hap kmers.

Alternate option is to use trio-binning implemented in Canu, but this might fit better if you are actually using hapmers inferred from the parental genomes as it normalizes based on the kmers found in the dbs.

/path/to/canu/build/bin/splitHaplotype \
 -cl 1000 -memory 48 -threads 24 \
 -H ../hap1.meryl 1 hap1.fasta.gz \
 -H ../hap2.meryl 1 hap2.fasta.gz \
 -A ./unk.fasta.gz \
 hifi_reads.fq.gz

-cl 1000 will filter out reads < 1kb.

DayTimeMouse commented 1 month ago

But I do not have real parental genomes, I used hifiasm(HiFi + ONT +HiC) to get hap1.fa and hap2.fa. In this case, do you still recommand me to use Canu(trio-binning splitHaplotype)? Canu(trio-binning) and count hapmers to partition reads via meryl-lookup, which method is fit better for me?

arangrhie commented 1 month ago

Then using meryl-lookup might work the best. Use meryl v1.4.1 by the way, there were some improvements / fixes made particularly in meryl-lookup.

DayTimeMouse commented 1 month ago

Many thanks!

DayTimeMouse commented 1 month ago

Hi,

I have another question. How to get haplotype-specific HiC reads(paired-end)? Using haplotype-specific kmers?

Best regards.