velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
159 stars 83 forks source link

Artificial gene detection problem #233

Open valnymartin opened 4 years ago

valnymartin commented 4 years ago

Hello, I ran into the problem with detection of tdTomato in created .loom file. I added tdTomato into my ref genome. I am able to detect it in raw output from cellRanger. However, after running velocyto and using Seurat wraper to convert output .loom file into seurat object I have 0 expression of tdTomato in all cells.

Do you have any idea what can be wrong?

Thank you

Martin

JonETJakobsson commented 4 years ago

Hi! I run into the same issue with ERCC genes. I've appended the ERCC information to the main GTF file including; Chromosome, Source, Feature, Start, End, Score, Strand, Frame, gene_id, transcript_id and exon_number.

After genome indexing and read align, I have reads mapping to the ERCC genes (using quantMode GeneCounts in STAR). However with:

velocyto run-smartseq2 -o velocyto -e $file_name "$wells_path"*/*.bam "$genome_dir"

The ERCC genes have 0 values in all layers of the loom file. Other genes works just fine.

Are there any other obligatory fields needed in the GTF file to make velocyto accept them?

Kindly, Jon

mliiv commented 4 years ago

Hi,

this might help some here: http://velocyto.org/velocyto.py/tutorial/cli.html#preparation

Requirements on the input files

But even when following this, I have a similar issue with an opposite outcome. After merging two gtf files, I am getting read counts for just one of them after the velocyto run. Don't know if the developers are any more actively replying to issues.

Cheers, Maria

JonETJakobsson commented 4 years ago

I've been experimenting a bit more, and I can get velocyto to recognize some genes that I added manually. These were located in a fasta file containing a full virus genome, so the annotated genes were located with flanking non coding sequences. Maybe velocyto use the flanking sequences for something, and is not happy when the genes are covering a whole chromosome?

This is the case for all other genes I have added. they start with ATG and the gene covers the full sequence.

Kindly, Jon

mliiv commented 4 years ago

Hi Jon,

thanks for your advice! Do you run the genes you add manually in a separate gtf and then catenate different gene sets across your cells later or did you manage to combine all transcripts of interest into one gtf ref for velocyto? Atm, I am able to run my two separate gtf files independently (one is the hg19 reference gtf and another one a custom ncRNA transcript set for hg19), and then catenate the gene sets; when I combine the two gtfs, I only get counts for one of them (the custom ncRNA transcript set for hg19).

I don't know if flanking regions have anything to do with read counting here. As I understand from the tutorial page, only exon rows in the gtf are considered in the run (although I don't know then how introns are added, actually). So I only used exons in my custom gtf. But maybe you do have a point and I should test adding some other fields into my custom gtf.

Cheers, Maria

JonETJakobsson commented 4 years ago

Hi

Exactly, only the exon feature is listed as obligatory, and the genes I got to work only had this feature in the GTF file, so that is promising. How did you merge the two GTF files? It sound like one of the gtf files is correct after merging, but the other got corrupted. Did they have the exact same format before you merge them?

I used pyranges to load the species genom GTF into a pandas data frame. but I also had to remove some strange columns that should not be in the GTF file. df = pr.read_gtf(gtf_file, output_df=True) df.drop(columns=["(assigned", "previous"], inplace=True)

I could then load my manually created csv with exon features, start, stop etc. for the artificial genes into a pandas data frame and merge it with the species data frame. In this way, I can make sure the columns of the two grf files match, and that obligatory data is present in for instance gene_id, gene_name, transcript_id and transcript_name fields.

other_df = pd.read_csv(other, sep=";") new_df = df.append(other_df)

I then turned the merge data frame to a pyranges object and saved it to a gtf file. new_gtf = pr.PyRanges(new_df) new_gtf.to_gtf("../mmusculus/all_genes_mmusculus.gtf")

This is only how I did it, and mind you, it is my first time merging gtf files 😃. must be better ways of doing it. but hope it can help!

Kindly, Jon

mliiv commented 4 years ago

Hi Jon,

thanks for this, very useful! I have basically just been using the hg19 gtf that came with cellranger references, and then pretty much made my own ncRNA transcript gtf manually since just some 100 transcripts (which is not the best way, I do agree). But both of them worked separately, so I didn't suspect any bugs (although I had some in the beginning). I also used them to run kallisto - cat together as a ref for hg19 indexing; indexing and counting went without errors/warnings.

In my velocyto verbose, I get this after reading the gtf: Generated 2095490 features corresponding to 166292 transcript models from hg19_genes_ncrnas.gtf

But then later that 0 introns have been validated: Validated 0 introns (of which unique intervals 0) out of 964599 total possible introns (considering each possible transcript models).

There has been an issue about this but no reply yet from the authors: https://github.com/velocyto-team/velocyto.py/issues/242

So I can maybe re-save my gtf with your code and try to re-run once again, cheers, Maria

JBreunig commented 4 years ago

Could it be related to this issue: https://groups.google.com/forum/#!searchin/kallisto-and-applications/transgene%7Csort:date/kallisto-and-applications/BFX5mnzDNQM/NMg2o8ulBAAJ

I was able to get my transgene annotated in the loom with kallisto/bustools but I'm not having luck with velocyto.

RoganGrant commented 1 year ago

I ran into the same issue when working with a hybrid genome (host and pathogen). My solution ultimately was to use the host genome only as IAV did not follow the expected GTF format and then readjust each of the mouse columns to reflect the genome prefix that cellranger adds when running mkref on multiple genomes. In R, this was my general solution:

library(rtracklayer)
mask_gtf = readGFF("mm10_rmsk.gtf")
mask_gtf$seqid = factor(ifelse(is.na(mask_gtf$seqid),
                               yes = NA_character_,
                               no = paste0("GRCm38______________", mask_gtf$seqid)))
mask_gtf$gene_id = factor(ifelse(is.na(mask_gtf$gene_id),
                                 yes = NA_character_,
                                 no = paste0("GRCm38______________", mask_gtf$gene_id)))
mask_gtf$transcript_id =  factor(ifelse(is.na(mask_gtf$transcript_id),
                                        yes = NA_character_,
                                        no = paste0("GRCm38______________", mask_gtf$transcript_id)))
export(object = mask_gtf, con = "mask.gtf", format = "gtf")

genes_gtf = readGFF("refdata-gex-mm10-2020-A/genes/genes.gtf")
genes_gtf$seqid = factor(ifelse(is.na(genes_gtf$seqid),
                                yes = NA_character_,
                                no = paste0("GRCm38______________", genes_gtf$seqid)))
genes_gtf$gene_id = factor(ifelse(is.na(genes_gtf$gene_id),
                                yes = NA_character_,
                                no = paste0("GRCm38______________", genes_gtf$gene_id)))
genes_gtf$transcript_id =  factor(ifelse(is.na(genes_gtf$transcript_id),
                                         yes = NA_character_,
                                         no = paste0("GRCm38______________", genes_gtf$transcript_id)))
genes_gtf$gene_name =  factor(ifelse(is.na(genes_gtf$gene_name),
                                     yes = NA_character_,
                                     no = paste0("GRCm38______________", genes_gtf$gene_name)))
export(object = genes_gtf, con = "refdata-gex-mm10-2020-A/genes_edited.gtf", format = "gtf")