stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
323 stars 86 forks source link

Error in annotation[annotation$type == "body", ] : incorrect number of dimensions #1531

Closed leecon30 closed 10 months ago

leecon30 commented 10 months ago

Hi I've been attempting to run through your vignette for multiomodal data and all was going well until I got to coverage plot. I've been troubleshooting it for 5 days now and i've almost given up. What I know so far.

  1. I am able to link peaks - it runs and detects the gene names and whether or not there are peaks linked. so its not likely a problem with the GTF
  2. I can get it to work as long as I specify a region that does NOT contain a gene (see Rplot05) and the output is your standard track with peaks and obviously genes and links are blank. And this is reliable for any coordinates in the provided BSgenome that do not contain a gene. So theres nothing wrong with the BSgenome. e.g.

p1 <- CoveragePlot(seurat, region = "chr1-99128000-99130000", features = "LOC115705239", extend.upstream = 1000, extend.downstream = 1000)

Rplot05

4.However, when I want to investigate a gene of interest by putting in either the gene_name (yes i've ensure this is properly formatted in the GTF and in UCSC style), or if i provide the coordinates corresponding to any locus with a gene I get the following error.

p1 <- CoveragePlot(seurat, region = "LOC115705239", features = "LOC115705239", extend.upstream = 1000, extend.downstream = 1000)

Error in annotation[annotation$type == "body", ] : incorrect number of dimensions

or when I use the coordinates for that gene -

p1 <- CoveragePlot(seurat, region = "chr1-99131488-99139130", features = "LOC115705239", extend.upstream = 1000, extend.downstream = 1000)

Error in annotation[annotation$type == "body", ] : incorrect number of dimensions

Any ideas how I can fix this issue? I cant make sense of that error.

Please if you could shine light on the problem that would be a great help.

Thank you, Lee ```

leecon30 commented 10 months ago

Update, I've now been able to narrow down the issue even further - if i set annotation to false i am able to map the genes now and show linkage at the expense of discarding the gene portion of the plot. Rplot04

So what do you think could be the problem with the annotation? specifically any ideas where I can look to what annotation its trying to pull to inspect it? im not familiar with the error code.

Getting close though im happy i can look at the region now at least haha

leecon30 commented 10 months ago

I think it maybe because I'm using an NCBI GTF annotation and following the steps as

gene.coords <- gtf[gtf$type == 'gene']

Subletting only the gene type and not including transcripts, exon, CDS etc.

Any idea how to subset to include these?

timoast commented 10 months ago

You should skip that step gene.coords <- gtf[gtf$type == 'gene']

I now realize this is on our FAQ page for using a GTF, but it hasn't been updated for a long time. Will update it now

leecon30 commented 10 months ago

Hi timoast, Yeah i figured that would fix it and I tried that but TSS enrichment doesnt like my GTF i get The TSSenrichment error: NA value not accepted. Thats why I was wondering if there was some kind of gene.coords <- gtf[gtf$type == 'gene'] code that i code simply include additional types e.g. transcript, exon, CDS. I tried gene.coords <- gtf[gtf$type == 'gene','transcript','exon','CDS'] but that doesnt work. Perhaps you know an easy work around for what im trying to achieve.

In the mean time ill compare the output from gene.coords <- gtf[gtf$type == 'gene'] with my GTF to try see what NA values are causing the TSS enrichment error.

Really glad i've gotten to the core of the problem now though :D 5 days scratching my head haha. The joys of working with plant species with limited annotation resources.

timoast commented 10 months ago

if you want to only keep some of the annotations you can run something like this:

gene.coords <- gtf[gtf$type %in% c('gene','transcript','exon','CDS')]
leecon30 commented 10 months ago

Hey Timoast,

I just managed to do something similar. gene.coords <- gtf[gtf$type == c('gene', 'transcript', 'exon', 'CDS', 'start_codon', 'stop_codon')]

and that does indeed pull them out. but it's also pulling out NAs in the logical subsript causing the TSS enrichment conflict.

Are there any work arounds for these NA values?

Extracting TSS positions Error: logical subscript contains NAs?

timoast commented 10 months ago

I just managed to do something similar. gene.coords <- gtf[gtf$type == c('gene', 'transcript', 'exon', 'CDS', 'start_codon', 'stop_codon')]

You need to use %in% rather than == otherwise you're going to subset it wrong

Are there any work arounds for these NA values?

If you want to remove rows with NA:

gene.coords <- gtf[!is.na(gtf$type)]
leecon30 commented 10 months ago

ok ill give this a try. Thanks lets see how it goes

leecon30 commented 10 months ago

unfortunately the same issue. Extracting TSS positions Error: logical subscript contains NAs ill keep trying to see if i can sort out the NA problem. Ill let you know if i can fix it

leecon30 commented 10 months ago

Ok i think i've found the issue - NCBI gtf files only associate the gene_biotype tag with the gene type row resulting in NA values in place of the gene_biotype tage for the other types e.g. CDS, Transcript, Exon.

So to get it to work I have to somehow add the corresponding gene_biotype tag to each row of the other types hmmmmmmm

timoast commented 10 months ago

if you have a link to the gtf you're using I can take a quick look

leecon30 commented 10 months ago

Thank you very much thats very kind of you. Just to note there an ensembl version of this genome but it is the outdate .1 suffix which is drastically different and therefore not suitable to use - whole chromosome renumbering etc. So im stuck with NCBI GTF

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/626/175/GCF_900626175.2_cs10/GCF_900626175.2_cs10_genomic.gtf.gz

leecon30 commented 10 months ago

Sorry just to note, i have modified that file i sent you on my local device. first Ive renumbered the chromosomes (UCSC) to chr1, etc etc. I've also added and additional category to column 9 namely gene_name; where the gene_name i designated as the same characters for the gene ID.

leecon30 commented 10 months ago

I deleted the previous comment because that work around causes its own problems hmmmm

leecon30 commented 10 months ago

update - I ran TSS enrich using the tss.positions function to specify gene.coords <- gtf[gtf$type == 'gene'], while the chromatin assay used the full GTF annotation with exons, transcripts etc.

This gave me a new problem with coverage plot downstream, fortunately this new problem provided new insights and was reported before [1159] which drew my attention to the transcript_id column in my GTF. I used notebook and the replace function to change transcript_id with tx_id.

This column name was causing the problem with TSS enrichment Error: logical subscript contains NAs. Its now processing TSS enrichment with all the types e.g. exons, CDS, gene, transcript, start codon etc. Albeit its taking much longer now than with just the gene type on my local computer. But i think this is a good sign.

Hopefully it should work now and we will see.

If it does work, I think it would be very helpful to add in your custom GTF FAQ section to make sure if using a refseq annotation to rename the transcript_id column to tx_id. after importing the GTF with Rtracklayer and then its ready to go.

Ill keep you posted as it goes. Fingers crossed :D

leecon30 commented 10 months ago

Ok I've solved the problem. I can work on the analysis now haha. Thanks :D

Rplot05