sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
273 stars 67 forks source link

errors when running zUMIs using non-reference genome annotation #209

Closed olechnwin closed 4 years ago

olechnwin commented 4 years ago

Hi,

I am helping my colleague to run zUMIs using our custom assembled genome. In the beginning we encountered the following error:

Error in .make_splicings(exons, cds, stop_codons) : 
  some CDS cannot be mapped to an exon
Calls: .get_tx_lengths
Execution halted

As the error mentioned, this error was due to some CDS annotation in the gtf file that does not have a corresponding exon annotations. The error came from the R function makeTxDbFromGFF in your runfeatureCountFUN.R. This I believe is due to our custom assembled genome not having the specific region annotated as exons. So, I went ahead and remove around 9 transcripts which give this error. Athough I fixed CDs error, I now have a different error from a different R function called within makeTxDbFromGFF;

Make the TxDb object ... Error in .makeTxDb_normarg_genes(genes, transcripts$tx_id, transcripts_tx_name = transcripts$tx_name) : 
  'genes$gene_id' must be a character vector (or factor) with no NAs

At this point, I am starting to think that trying to fix all these errors may not be the way to go. As a side note, I was able to successfully make txdb object if I used a gff file instead of gtf file. However, the next line of code which is: exon <- GenomicFeatures::exonsBy(txdb, by="gene") will throw an error, as there was no gene annotation in our annotation file.

The genome annotation we are using was hg19 reference genome annotation lifted to convert the genomic location to our assembled genome coordinate. I am attaching our assembled genome annotation for your reference.

cns_p.phased.0.lifted_cleaned_sort_name.gtf.gz

My colleague also noted that she was able to run previous version of zUMIs with our assembled genome.

Any suggestions on how to fix these? Thank you in advance for your help.

cziegenhain commented 4 years ago

Hi,

As you mention, the parsing of annotations from the GTF file seems to be the issue here. Since you mention liftover was performed, I could imagine the transcript length calculation (first error message above) failing due to impossible coordinates?

The function makeTxDbFromGFF is a standard of the popular GenomicFeatures bioconductor package. If there are errors with lines not conforming to the GTF spec, these errors can of course occur. The second error you mention is a lack of or a problem with the gene_id attribute in the 9th field of the GTF spec.

Importantly, zUMIs does NOT support transcript-level counts. You must have gene identifiers. Similarly, we also do not support GFF files as input.

Please carefully inspect your gtf file and make sure it conforms to the definition https://www.ensembl.org/info/website/upload/gff.html

I cannot speak to mentioning of the previous runs going through - just that there was not an error message doesn't mean an out of spec GTF file would have produced useful results.

Best, Christoph

olechnwin commented 4 years ago

Thanks for your quick reply. The first error message above was failing due to (I think) non-existent exon sequences in our assembled genome. I am not saying there is something wrong with the makeTxDbFromGFF function. I am asking whether there is a better way to do what makeTxDbFromGFF function does for assembled genome which are missing some regions because I think that's what happened. The link you provided above describes what the format of gtf file should be and our gtf file does follow this format. Just that some annotation are missing since this is after all assembled genome.

cziegenhain commented 4 years ago

Feel free to reopen if further help is needed.

/C