What pipeline should I use?

Yutang-ETH commented 3 years ago

Hi,

I have 77 gb filtered ONT long reads (30x coverage of my target genome), now I would like to know if there are chimeric reads and if so I would like to either split/scrub them or discard them. So, what pipeline would you recommend in terms of using fpa and yacrd? I am confused what's the difference between split and scrub? To me, it looks like scrub also split chimeric reads and does some extra trimming, is this correct?

Thank you very much.

Best wishes, Yutang

natir commented 3 years ago

Difference between chimeric read splitting and scrubbing is more on target of task than method to reach this target.

yacrd is build on idea when two read are merge, they create a part of sequence not present in genome, this type of error is rare and not systematic so we found this sequence only one time. So chimeric read contain a low covered region in middle. Split low covered region is just a solution to manage chimeric read.

Idea of scrubbing is region with high error rate create many trouble, generate error in analysis increase analysis time. With long read technology error "purely" random, error are rare event and generate a part of sequence not present in genome. So high error rate region have low coverage. And we can remove it.

Chimeric read and high error rate region, can be detect and manage by same method but it's not the same thing, it's same algorithm but with different target and different parameter, and also split not remove low covered region at ends of reads.

looks like scrub also split chimeric reads and does some extra trimming, is this correct?

It hurts a little to admit it but it's true

yacrd output in scrubbing mode or chimeric split mode, use same method to label read (take care result could be different because parameter is different).

Now come back to your trouble.

I'm not sure yacrd is good for your trouble. yacrd memory usage is linear with number of overlap in #41 you indicate you have more than 1.5 TB of overlap. yacrd didn't store all overlap information but it's quite large. yacrd have a on disk mode but this mode take many more time.

I can propose two pipeline, a conservative and a rough

The conservative, we remove sequence, based on overlap we can trust, where coverage is 0.

minimap2 -x ava-ont reads.fasta reads.fasta | fpa drop -l 1000 > overlap.paf
yacrd -i overlap.paf -o reads.yacrd split -i reads.fasta -o split.fasta

on-disk mode version:

mkdir yacrd_tmp/
minimap2 -x ava-ont reads.fasta reads.fasta | fpa drop -l 1000 > overlap.paf
yacrd -i overlap.paf -o reads.yacrd  -d yacrd_tmp/ split -i reads.fasta -o split.fasta

version where we remove region with coverage lower than 2.

yacrd -i overlap.paf -o reads.yacrd -c 2 split -i reads.fasta -o split_c2.fasta

The rough one remove any region look like bad (I reuse best parameter found in this (publication)[https://doi.org/10.1093/bioinformatics/btaa262]) and extremity of read.

minimap2 -x ava-ont -g 500 reads.fasta reads.fasta | fpa drop -l 1000 > overlap.paf
yacrd -i overlap.paf -o reads.yacrd -c 4 -n 0.4  split -i reads.fasta -o split.fasta

on-disk mode version:

mkdir yacrd_tmp/
minimap2 -x ava-ont reads.fasta reads.fasta | fpa drop -l 1000 > overlap.paf
yacrd -i overlap.paf -o reads.yacrd -c 4 -n 0.4 -d yacrd_tmp/ split -i reads.fasta -o split.fasta

You can compress your overlap file if you want. You can also save your minimap2 output and after filter it with fpa.

To reduce paf disk usage you could also rename your read, original nanopore id use space for nothing, and this id is repeat two time for each paf entry.

Did you remove sequencing adapter? yacrd scrubb mode (rough pipeline) perform a pretty good job on this.

I hope I'm clear and I answer all your question, if you have any other question please ask.

natir commented 3 years ago

If you have reference genome or a not too fare reference genome you can try Alvis don't trust yacrd result present in publication they didn't use the good parameter

Yutang-ETH commented 3 years ago

Hi,

Thank you very much for your very detailed answer, it is very helpful. I will try what you proposed. The thing is we don't have a reference genome, we are aiming to create one. However, recently, when I aligned my ONT long reads to the assembly made from the same ONT data, I found some reads were soft clipped. I found some very long reads only half was mapped to the assembly and the other half was soft clipped. This is very strange to me and that's why I want to check if we have chimeric reads. If we do have chimeric reads, split them would be good for me. Thank you very much for your help.

Best wishes, Yutang

natir commented 3 years ago

A solution to check if these half mapped read are chimera faster, you could try this:

# create target.lst a file with id of all your suspicious read, one by line
minimap2 -x ava-ont reads.fasta reads.fasta > all_overlap.paf
grep -f target.lst all_overlap.paf > overlap_of_suspicious.paf
yacrd -i overlap_of_suspicious.paf -o reads.yacrd split -i reads.fasta -o split.fasta

I save all overlap in all_overlap.paf to be allow to play with contains of target.lst, mapping will take many time, but other step could be faster (around minute), so you can play with this step easier.

Yutang-ETH commented 3 years ago

Hi,

Sorry for my late reply. Last week I tried the suggested pipeline but at the first step (all-vs-all alignment), it took too long for my read dataset and also the paf file was too large to contain on my disk, so I gave up.

Anyway, thank you very much for your kind help.

Best wishes, Yutang

natir commented 3 years ago

Yacrd has really problems to work on very large datasets. Unfortunately the only solution would be to change the method, I have ideas for this but I lack time to realize them. Sorry I couldn't help you.

Yutang-ETH commented 3 years ago

No problem. Please don't feel sorry and I believe you will improve it in the future. Cheers, Yutang

natir / yacrd

What pipeline should I use? #45