mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
20 stars 9 forks source link

UMI usage in genotyping #40

Closed maxim-h closed 4 months ago

maxim-h commented 4 months ago

Hi,

I have a question about the genotyping of single cells with the sc_mutations function. Are UMIs used there to collapse duplicates or correct sequencing errors? If not, is it a planned feature?

I see in the underlying variant_count_matrix_cpp functions it is being read, but I don't see that info being used anywhere downstream.

ChangqingW commented 4 months ago

The plan is that https://github.com/DavidsonGroup/flexiplex would do consensus read calling and collapse the UMIs, and so nothing will need to change for sc_mutations (FLAMES uses a built-in Rcpp version of flexiplex for de-multiplexing when barcode lists are given.)

maxim-h commented 4 months ago

Thank you for clarification. A couple of things I wanted to clarify:

  1. At the moment in FLAMES I use align2genome.bam to retrieve genotypes. Are UMI used in any way to adjust the counts, e.g. just skipping the reads with already observed UMIs?
  2. flexiplex creates a fastq file. So another re-alignment step would be needed, right? So currently the built-in version creates matched_reads_dedup.fastq file in the output directory. Are UMIs already collapsed there? But it's not used for sc_mutations yet, right?
  3. Would flexiplex be able to correct sequencing errors in UMIs or at least collapse them within a certain edit distance? And more importantly would it be able to correct the errors in the nanopore reads themselves to improve the confidence of the called alleles?
ChangqingW commented 4 months ago
  1. Currently the only step where UMI is used is when running gene quantification, for reads with the same UMI, it will keep the longest read and discard the rest. The deduped reads are saved as matched_reads_dedup.fastq and used in later steps. This could obviously be done more sophisticatedly.
  2. Flexiplex is planned to do consensus read calling using only the FASTQs, reference free. Once that is implemented, we will have the de-multiplexed and de-duped reads before initial alignment, and the align2genome.bam would be using the de-duped reads hence nothing needs to be updated in sc_mutations.
  3. I am not sure but I would assume they will make it allow some edit distance. Yes, they will use majority voting (and maybe weighted by the quality score) to correct sequencing errors.