mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
791 stars 168 forks source link

Assembled genome contains variants not supported by reads #387

Open elcortegano opened 3 years ago

elcortegano commented 3 years ago

Hi, we have noted that assemblies generated with Flye very frequently introduce variants that are not supported by the reads, particularly at highly repetitive regions (but also in others as well). One example is show below, were no reads support the G -> A variant.

Screenshot from 2021-05-25 17-07-19

The first grey line is the assembly obtained with Flye, and the second one is an assembly obtained with a different assembler. The following in the bottom are the reads.

Using assembly-to-reference variant calling with paftools and reviewing calls against the assemblies and the reads, we observe a false positive rate that exceeds the 80%. These false positive calls will not be called by variant callers based on read-to-reference assemblies.

We wonder why could this happen, and if it can be solve some way,

Our reads come from PacBio HiFi. Genomes were assembled with:

flye --pacbio-hifi <file.fastq> -t 64 -g 111.1M -o <sample id>

Postprocessing was limited to purge_dups to remove duplicates following https://github.com/dfguan/purge_dups. Scaffolds were also broken into contigs. No polishing (e.g. arrow or pilon) was used for our data.

mikolmogorov commented 3 years ago

Hi,

Very likely related to the repetitive nature of these regions. Flye is using secondary alignment to error-correct repetitive regions, but in this case the reads from different repeat copy might have outvoted the true reads. If you have bam alignment with secondary reads and enable them in IGV, you should be able to see reads supporting the alternative nucleotide.

But otherwise this is indeed an issue. We have recently made an update in alignment selection logic, which I hope should fix this issue. Would you mind trying the latest version from the flye github branch? You can restart your assembly from the polishing phase.

elcortegano commented 3 years ago

Hi @fenderglass , thank you for your reply,

We used version 2.8.2-b1689 for the assembly. Just for clarity, do you mean using 2.8.3 instead? or a develop branch? Thanks.

mikolmogorov commented 3 years ago

You need to get and compile the latest code from github, flye branch (which is default). This code is not a part of a release yet. Instructions are here: https://github.com/fenderglass/Flye/blob/flye/docs/INSTALL.md#local-building-without-installation

elcortegano commented 3 years ago

I can confirm that some variants disappear in the assemblies generated with the latest version (including the one shown as example above), but it is only a small fraction, and most other false positives remain.

Is there a way to avoid the use of secondary reads here? there would be any drawbacks in doing so? We usually remove these from read-to-reference alignments.

Thanks

mikolmogorov commented 3 years ago

@elcortegano with the latest code Flye should only use lower confidence alignment, if the coverage of "reliable" alignments (that is, non-secondary, non-supplementary, MAPQ>30) is low.

Do you use MAPQ filter for your alignments? In repetitive regions, some reads that map ambiguously to multiple repeat copies will be arbitrarily assigned to one copy as "primary", so it is also necessary to filter out alignments with low MAPQ (e.g. <10, <20 etc; Flye is using a conservative <30 cutoff).

Could you post a couple examples with IGV view of the problematic regions? Preferably zoomed out (e.g. several kb region), coverage track and with secondary alignments. This should tell us if there is any issue in the alignment selection logic.

elcortegano commented 3 years ago

We do not use MAPQ filter. Alignments for the reads are done with the recommended PacBio tool pbmm2:

pbmm2 align --num-threads 32 --preset HIFI --sort -c 0 -y 70 --sample sample_name reference.fa reads.fastq output.bam

Next, alignments are sorted with samtools sort, and filtered to remove secondary alignments (with samtools view -bh -F 256). I have re-aligned one of our samples without the step to remove secondary alignments. Stil, cannot see reads that support the variant generated (see below). Wonder if its because pbmm2 is already disregarding these reads anyway.

Snapshots for different levels of zooming below. The top line is the assembly with version 2.8.3, and below are reads mapped with pbmm2, as before, and now without using the -F 256 flag on samtools view:

Screenshot from 2021-05-31 11-17-22

Screenshot from 2021-05-31 11-15-11

Screenshot from 2021-05-31 11-18-31

Is that ok? Do you have any tool preference for generating these alignments including secondary reads?

There are other examples in non-repeated regions. Should I submit one of these as well?

Thanks

mikolmogorov commented 3 years ago

@elcortegano yes, according to documentation pbmm2 discards all supplementary alignments. I suggest doing the same analysis with minimap2 with -ax map-pb -p 0.5 -z 1000.

elcortegano commented 3 years ago

Done, but it only minimally affects the appearance of the alignments.

Screenshot from 2021-06-01 00-49-46

mikolmogorov commented 3 years ago

Well, I guess one last thing to check is if showing secondary (and supplementary) alignments is enabled in your IGV settings. Check both "alignment" and "third gen" tabs. It seems suspicious to me that there are no secondary alignments within repetitive regions, provided that the copies are similar enough.

Otherwise I'm out of ideas. If you want, you can share the assembly / reads and I can take a look as well.

elcortegano commented 3 years ago

IGV is set to allow secondary and supplementary alignments.

For our purposes, we will use variant callers on the reads, and calls from the assembly as one step to confirm these calls. So it is not a big deal to get these false positives. Hopefully this is something that will eventually be fixed in future versions.

Thanks for all the feedback and support.

WeichenZhou commented 6 months ago

We met the same issue using Flye thus it introduces many false positive SNV calls from contig based method.