marbl / merqury

k-mer based assembly evaluation
Other
272 stars 19 forks source link

ccs.bam, fasta assemblies and meryl dataset #76

Closed Overcraft90 closed 2 years ago

Overcraft90 commented 2 years ago

Hi I have a quick question,

Unfortunately, I'm dealing with some ccs.bam files for which it is not trivial to get back to the fastq counterpart...; however, on S3 they have been made available also the diploid assemblies for this human individual.

So, my question is what would be in this case the best way to go in order to build the db meryl? In fact, one way would be to convert bam to fastq (which is causing me some troubles), on the other hand for the same individual are also available CLR data for which they have been made available fastq files. Would it be correct to use CLR to build the db meryl when I then use assemblies (both maternal and paternal) obtained form HiFi data?

Thanks in advance.

skoren commented 2 years ago

You cannot use CLR reads for merqury evaluations, their k-mer distributions are dominated by errors.

Luckily, you can can convert a bam file to fasta or fastq with samtools and awk:

samtools view reads.ccs.bam |awk '{RQ=-1; for (i = NF; i > 11; i--) { if (match($i, "rq:")) RQ=substr($i, 6, length($i)); } if (RQ >= 0.99) { print "@"$1; print $10; print "+"; print $11; }}'  > reads.ccs.fastq

or

samtools view reads.ccs.bam  |awk '{RQ=-1; for (i = NF; i > 11; i--) { if (match($i, "rq:")) RQ=substr($i, 6, length($i)); } if (RQ >= 0.99) { print ">"$1; print $10; }}' |fold -c > reads.ccs.fasta

This will confirm the read is Q20 (RQ >= 0.99) and dump its sequence.

Overcraft90 commented 2 years ago

Hi @skoren,

Thank you so much! I will try and let you know.

Overcraft90 commented 2 years ago

Thank you so much,

I have been able to retrieve HiFi reads from my noisy CLR data. Thanks a lot for sharing these two short script! I think we can close the issue.