Duplicate reads and read splitting option in MinKNOW

kaanokay commented 1 year ago

Hi,

I obtained fastq and bam files from PrometION device and I realized that ~2.4% of reads were duplicate. They have the same read length and read id. I used MinKNOW software with default where read splitting option was turned on. Is there any relationship between read splitting option in MinKNOW software and getting duplicate reads?

Another reason to getting duplicates maybe related to how merging small bam files into one bigger bam file by using samtools cat or samtools merge, but in both case, I got the same number of duplicates and the same read ids.

Do you have recommendation for this in terms of downstream analysis regarding/regardless of duplicate reads.

Kind regards.

ollenordesjo commented 1 year ago

Hi @kaanokay,

Thanks for raising this. The way read splitting works in minknow is that it'll search for adapter in each read. If it finds a match, it will split out the left/right children and give them new read_id values. The duration of the calls and the length of the reads should be adjusted. The parent_read_id values however will remain the same for both of them. I'm wondering if this is playing a part in where you are finding duplicates?

Could you share the method you used for detecting duplicates? I would be keen to take a look at it to see if there's a simple explanation for it (there should be hopefully!).

Best regards

kaanokay commented 1 year ago

Hi @ollenordesjo,

Thank you for explanation.

I merged bam files by samtools merge (samtools merge -n -c -@ 128 -o merged_files.bam bam_pass/*.bam).
I converted bam files to fastq files by samtools (samtools fastq -TMM,ML --threads 128 merged_files.bam > file.fq) .
I searched duplicates in fastq files by seqkit (seqkit rmdup file.fq -n -D duplicate_statistics.csv -o duplicate_removed_file.fq).

Third step gave me list of duplicate ids. Duplicates were the same in terms of their length, full read id, and sequences.

When I looked into fast5 files, bam files, and fastq files, both bam and fastq files have duplicate reads but all reads in fast5 file are unique. I think, something happens during basecalling to create duplicates.

When I also changed merging step from samtools merge to samtools cat, I got the same results in terms of duplicates.

The most of reads had two copies but I got also duplicates with 3 or 4 copies so if read-splitting creates two subreads, this duplicate read issue does not come from read-splitting.

Recommadation from ONT bioinformatics team was turning off read-splitting option in MinKNOW software and see how duplicate issue looks like.

Thank you for interested.

Bw.

ollenordesjo commented 1 year ago

Thanks for additional information @kaanokay.

Sorry for taking a while to respond as well. I'm currently investigating. I noticed that -o in samtools merge has been deprecated, so wanted to check which version of samtools this was on?

Thanks!

ollenordesjo commented 1 year ago

Also, I can confirm I'm seeing the same effect, triaging this internally. Thanks for spotting!

kaanokay commented 1 year ago

Hi @ollenordesjo,

I used samtools v1.15, but latest is v1.17. I'll upgrade it and will merge bam files again by using latest version.

I'll let you know when I done it again.

Thank you for your interest!

ollenordesjo commented 1 year ago

Thanks for confirming @kaanokay!

It's been reproduced, and we've spotted the cause of the issue. It should be fixed in an upcoming release.

Things to note:

It may happen when writing BAMs from MinKNOW
Original fastqs should not be duplicated by this issue
It should only affect reads which have been split using basespace adapter-splitting

Thus, the recommended solution for now is to deduplicate bam files using for example seqkit rmdup as you have done.

Thanks for reporting this!

kaanokay commented 1 year ago

Hi @ollenordesjo,

I want to re-merge bam files with the latest version of samtools. I'm wondering what will change when I upgrade it.

If these duplicates arise from read-splitting, should I turn it off in MinKNOW? If I turn off this option, I'll have chimeric reads that may align different regions of the genome, so I'm not sure about it.

Thank you so much for your detailed and quick feedback. It's really helpful for me.

Bw.

ollenordesjo commented 1 year ago

Hi @kaanokay,

Sorry for red herring with the suggestion to update samtools, I thought best to make sure there was no issue with an earlier samtools version which was already fixed.

Regarding what to do going forwards, I would recommends sticking with read-splitting on in minknow, it does help with avoiding chimeric reads now that our capture rate is so high that there is barely any down-time between reads (basespace adapter search is one way to address this fortunate and unfortunate effect).

So: 1) Do keep read splitting on in minknow 2) Deduplicate bam files.

I would recommend using samtools markdup (which lets you avoid the fastq round-trip), note it needs sorted input:

samtools markdup -@ 32 merged_files_sorted.bam -r -s merged_files_sorted_dedup.bam

kaanokay commented 1 year ago

@ollenordesjo, thank you for those suggestions. I'll consider them in analysis.

Btw, what is mean of fastq round-trip?

nanoporetech / duplex-tools

Duplicate reads and read splitting option in MinKNOW #39