mritchielab / FLAMES

A framework for performing single-cell and bulk read full-length analysis of mutations and splicing.
https://mritchielab.github.io/FLAMES/
GNU General Public License v3.0
17 stars 9 forks source link

Include gene name in mutation calling output file #19

Closed HenriettaHolze closed 5 months ago

HenriettaHolze commented 9 months ago

Hi, I ran the mutation calling with FLAMES but would like to get an output file where SNP positions are matched to the gene ID or gene symbol. This information should be available since every read has been mapped to a gene by minimap2. E.g. as additional column in the allele_stat.csv.gz output file.

I ran the mutation calling using the python functions as follows:

conda activate /home/hholze/.cache/R/basilisk/1.8.1/FLAMES/1.7.4/flames_env
cd ~/R/x86_64-pc-linux-gnu-library/4.2/FLAMES/python/
python3

import os
from sc_mutations import sc_mutations
from bam_mutations import parse_gff_tree, get_gene_flat, get_gene_blocks, get_all_SNV_table

sc_mutations(
    fa_f="/data/reference/cellranger/refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa",
    bam_short=False,
    out_dir="preprocessed/DV_0427_3_flames",
    barcode_tsv="preprocessed/barcodes_DV_0427_3_trimmed.tsv",
    gff_f="/data/reference/cellranger/refdata-cellranger-GRCh38-3.0.0/genes/genes.gtf",
    report_pct=(0.1, 0.9)
)

Lastly, it would be useful to have the gene symbol in the gene_count.csv output file of FLAMES, next to the gene ID.

Cheers!

ChangqingW commented 6 months ago

Hi Henrietta,

I reimplemented the mutations stuff and it is now split into two parts: a mutation identification function find_variants that treat the long-reads as bulk and find mutations, returning a tibble with gene name, position, allele, read counts and a local homopolymer percentage; a mutation caller at the single-cell level, sc_mutations that takes positions and count the number of reads supporting each allele at the positions for each single-cell barcode, also returning a (massive) tibble that have the cell barcode, allele, read counts and depth columns. Apart from native multithreading in R, I think this gives more control and information for the variant calling part, where you can filter potential mutation with the tibble yourself before counting reads at single-cell level, and call mutation status for single-cells with your own rules instead of some arbitrary simplistic threshold. I wonder if you think this is over-complicating things or actually helpful as a user. These functions are currently only available on Github, you can install with BiocManager::install("mritchielab/FLAMES") to try them out.