interpreting results - Githubissues

natir / yacrd

Yet Another Chimeric Read Detector

MIT License

72 stars 8 forks source link

interpreting results #44

Closed colindaven closed 3 years ago

colindaven commented 3 years ago

Hi,

so the tool ran easily - thanks - but I am a little concerned with the results.

wc -l *.yacrd 1964840 iddm_report.yacrd

grep -c Chimeric iddm_report.yacrd 114108

grep -c NotBad iddm_report.yacrd 454940

grep -c NotCov iddm_report.yacrd 1395792

As I understand it, out of 1.9m reads, only 454k are NotBad and can therefore be used in further analyses ? From work to date with the unfiltered data (WGS Rat, just genomic alignments), I think most reads are pretty decent.

Or should I be happy with the NotCov reads ?

Commands:


srun -c 16 minimap2 -t 16 -x ava-ont -g 500 iddm_30kbp_3325_comb.fastq.gz iddm_30kbp_3325_comb.fastq.gz > iddm_overlaps.paf &
yacrd -i iddm_overlaps.paf -o report.yacrd -c 4 -n 0.4 scrubb -i iddm_30kbp_3325_comb.fastq.gz -o iddm_30kbp_3325_comb.fastq.gz.scrubb.fasta

natir commented 3 years ago

Hi, thank for your interest to yacrd.

As I understand it, out of 1.9m reads, only 454k are NotBad and can therefore be used in further analyses ?

To a first approximation I would say yes.

But it is possible that the recommended parameter is a bit to strict for your data.

Based on your message I guess your data is genomic nanopore R9.4 of rat

What is your coverage? What is your error rate?

colindaven commented 3 years ago

Thanks for that.

Coverage is about 3X, I have another one at about 7X too though.

It's ONT 9.4.1, the accuracy is about 92% from memory.

natir commented 3 years ago

Ok it's clearer now, the recommended parameters were determined on datasets with coverage around 30x and 60x, I will add this information in the readme thanks for the bug report.

If you just want detect chimera I think you should run:

minimap2 -x {corresponding preset} {your other parameter} reads.fq reads.fq > overlap.paf
yacrd -i overlap.paf -o reads.yacrd

I don't think run scrubbing reads datasets with such a low coverage rate is a good idea. There is already not enough data for an assembly, reduce information isn't efficient. But if you want to try I think you should lower the minimum coverage to 1 -c 1.

colindaven commented 3 years ago

Ok, thanks. I'll just do the chimeric read detection. Certainly this was just a Minion test, I won't be performing assemblies on these datasets.

natir commented 3 years ago

With this low coverage I think yacrd can generate some false positive. There is a high chance that a region of the genome is sequenced only once, yacrd can't made difference between this type of read and chimera.

If you have a good reference genome I think map reads on reference is a best way to detect chimera. Alvis should help you, don't trust yacrd result present in publication they made a little mistake :smiley:.

If you have any other question please ask.